统计

CH1 Data mining

Major data mining tasks

  1. Classication and regression

    • Classication predicts categorical attribute values;
    • regression predicts numerical attribute values
  2. Cluster analysis

Given a set of objects, each having a set of attributes, and a
similarity measure among them, nd clusters (i.e., groups) such
that

  • objects in one cluster are more similar to one another
  • objects in separate clusters are less similar to one another
    unlike classication, clustering analyzes objects without
    consulting a known class label
  1. Association analysis

Given a transactional database, nd the sets of objects that
frequently appear within the same transactions
also called frequent pattern mining

Various data repositories

  • relational data
  • data warehouses
  • transactional data
  • graph data
  • sequence data
  • time series
  • spatial data
  • text & multimedia data

CH2a Data preprocessing

-noisy
-inconsistent
-redundant

Data preprocessing tasks

  • types of attributes
    • Categorical
      - nominal: provide enough information to distinguish one object from another
      Example zip codes, employee ID numbers, eye color, gender
      - binary: assume only two values (e.g., yes/no, true/false, 0/1)
      - ordinal: provide enough information to order objects
      Example grades, fgood,better,bestg
    • Numeric (continuous)
  • descriptive data summarization
    gives the overall picture of the data
    involves
    • measuring the central tendency
      • mean
        The mean is sensitive to extreme values
      • weighted mean
      • Trimmed mean: disregards the low and high extremes
      • a measure that is not sensitive to extreme values is the
        median, which represents the middle value of an ordered set
        of observations
      • mode: the value that occurs most frequently in the set
      • midrange: average of the largest and smallest values in the
        data
    • measuring the dispersion
      - range: di�erence between the largest and smallest value
      - kth percentile: value xi with the property that k percent of
      the data are smaller than xi (what percentile is the median?)
      - quartiles: 25th percentile (denoted by Q1), 50th percentile,
      and 75th percentile (denoted by Q3)
      - interquartile range:
      IQR = Q3 - Q1
      - five number summary: consists of minimum, Q1, median, Q3,
      maximum
      - standard deviation : square root of variance ^2
    • graphical display of descriptive summaries
      • boxplots
      • histograms
      • scatter plots
  1. Data cleaning
    fill in missing values
    e.g., Occupation="
    smooth out noise, containing errors or outliers
    faulty data collection instruments
    human or computer error at data entry
    errors in data transmission

    outlier: usually, a value higher/lower than 1.5 x IQR
    e.g., Salary = -10"
    correct inconsistencies in the data
    e.g., Age = \42", Birthday = \03/07/2010"
    e.g., discrepancy between duplicate records

Given N tuples, are numerical attributes A and B correlated?


图片.png
  1. Data integration
    Data integration combines data from multiple sources into a coherent data store

Entity identification problem
Do two objects from different data sources refer to the same entity?
Example Is the record that has customer id = 234 (from one source) equivalent to that where cust num = 234 (from the other source)?
Metadata can help e.g., for each attribute, look at the name, meaning, data type, range of values permitted, etc

data value conflicts
For the same entity, attribute values from different sources may differ e.g., weight measured in kilograms or pounds

data redundancy

  1. Data transformation
    (Goal: modify the data in order to improve data mining performance)
  2. Data reduction

attribute/feature construction

normalization: scaled to fall within a smaller, specied range

min-max normalization

z-score normalization

Data reduction

你可能感兴趣的:(统计)