Unclean Data: Low Quality vs. Untidy

Unclean Data: Low Quality vs. Untidy

Unclean data 存在两类问题:数据质量低,数据不整洁。英文名称分别对应于Low Quality Data/Dirty DataUntidy Data/Messy Data

打个比方,在一个脏乱的房间里,脏数据(Low Quality Data/Dirty Data)就像房间内的垃圾、灰尘、香蕉皮等;杂乱数据(Untidy Data/Messy Data)就像房间里胡乱放置的东西、衣服、书等。

Low Quality Data/Dirty Data

低质量数据(Low Quality Data/Dirty Data)通常对应于内容问题(Content Issues)

low quality data = dirty data = content issues

比如,
不准确的数据(inaccurate data),
损坏的数据(corrupted data),
重复数据(duplicate data)

Sources of Dirty Data

  • We’re going to have user entry errors.
  • In some situations, we won’t have any data coding standards, or where we do have standards they’ll be poorly applied, causing problems in the resulting data.
  • We might have to integrate data where different schemas have been used for the same type of item.
  • We’ll have legacy data systems, where data wasn’t coded when disc and memory constraints were much more restrictive than they are now. Over time systems evolve. Needs change, and data changes.
  • Some of our data won’t have the unique identifiers it should.
  • Other data will be lost in transformation from one format to another.
  • And then, of course, there’s always programmer error.
  • And finally, data might have been corrupted in transmission or storage by cosmic rays or other physical phenomenon. So hey, one that’s not our fault.

Untidy Data/Messy Data

不整洁数据(Untidy Data/Messy Data)通常对应于结构问题(Structural Issues)

untidy data = messy data = structural issues

除了整洁数据,剩下的就是不整洁数据;那么何为整洁数据(Tidy data):

Tidy data requirements:
1. Each variable forms a column (每个变量构成一列)
2. Each observartion forms a row (每个观察构成一行)
3. Each type of observational unit form a table (每类观察单元构成一个表格)
by Hadley Wickham

(数据整洁度问题 详见此笔记)

Sources of Messy Data

Messy data is usually the result of poor data planning. Or a lack of awareness of the benefits of tidy data.

你可能感兴趣的:(unclean,data,学习笔记)