DataMining(2)_Data Preprocessing

Data Quality: Why Preprocess the Data?

Accuracy: correct or wrong, accurate or not
Completeness: not recorded, unavailable
Consistency: some modified but some not, dangling
Timeliness: timely update?
Believability: how trustable the data are correct?
Interpretability: how easily the data can be understood?

Major Tasks in Data Preprocessing

  1. Data cleaning
    Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty, human or computer error, transmission error
    Incomplete (Missing) Data
    Noisy Data
    Binning
    Regression
    Clustering
    Combined computer and human inspection

  2. Data integration
    Combines data from multiple sources into a coherent store
    Handling Redundancy in Data Integration
    Correlation Analysis
    1).Nominal Data:
    DataMining(2)_Data Preprocessing_第1张图片
    2).Numeric Data
    DataMining(2)_Data Preprocessing_第2张图片
    DataMining(2)_Data Preprocessing_第3张图片

  3. Data reduction
    Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results
    Data reduction strategies
    Dimensionality reduction, e.g.,remove unimportant attributes
    Wavelet transforms
    Principal Components Analysis (PCA)
    Feature subset selection, feature creation

Numerosity reduction(some simply call it: Data Reduction)
Regression and Log-Linear Models
Histograms, clustering, sampling
Data cube aggregation

Data compression

4. Data transformation and data discretization

你可能感兴趣的:(计算机-数据挖掘)