Data Quality

Data quality includes

  • missing
  • inconsistent
  • invaild
  • implausible(难以置信的)

Data preparation workflow

  • 1: How to use data profiling(剖析) methods to
    Characterise data and provide high-level insights
    Investigate data quality so it may be cleaned

  • Data preparation workflow includes three steps

  • Firstly, Discover
    What data sources and level of detail
    What spatio-temporal coverage(时空覆盖) and cost

  • Secondly, Wrangle(争辩)
    **Read in data, reformat(重新格式化), transform(转换), link(链接)

  • Profile
    Rigorous investigation of data quality

Subset of Data preparation

  • I: Look at your data
    Number of rows
    Example of Values
    Data Formate
    Data Type
    How is it encoded?

    1. Why people must care for Data Encoded
      Explain: If you use anything other than the most basic English text, people may not be able to read your data unless you state the character encoding
    1. File size & number of rows
    1. Check the data types
      Check the format yourself
      Don’t rely on heuristics(启发法)
      Don’t assume that all your data files use the same format, even if the files come from one source
    1. Example values
  • II: read your data correctly ---->Watch out for special values

  • III:Is all the data there?

  • 1:Missing values
    Terrible statistical terminology
    Advantages of visualization

  • 1.1: Missing at random(MAR)
    -Related to other variables
    – Term is misleading!

  • 1.2: Missing completely at random (MCAR)
    – Haphazard
    – Unrelated to values of variable, or other variables

  • 1.3: Missing not at random (MNAR)
    – Related to values of the variable itself

  • 2:Coverage (e.g. temporal or geographic)

  • 2.1: Temporal coverage

  • 2.2: Spatialcoverage

  • 3:Duplicates(重复值)

  • IV: Rigorously check data quality

  • How to write data validation rules
    1.1: Subject-matter special lists typically use free text to describe valid values and explain how to clean them
    1.2: Data scientist may need to write validation & cleaning rules as pseudocode

你可能感兴趣的:(Data Quality)