CH1 Data mining

Major data mining tasks

Classication and regression
- Classication predicts categorical attribute values;
- regression predicts numerical attribute values
Cluster analysis

Given a set of objects, each having a set of attributes, and a
similarity measure among them, nd clusters (i.e., groups) such
that

objects in one cluster are more similar to one another
objects in separate clusters are less similar to one another
unlike classication, clustering analyzes objects without
consulting a known class label

Association analysis

Given a transactional database, nd the sets of objects that
frequently appear within the same transactions
also called frequent pattern mining

Various data repositories

relational data
data warehouses
transactional data
graph data
sequence data
time series
spatial data
text & multimedia data

CH2a Data preprocessing

-noisy
-inconsistent
-redundant

Data preprocessing tasks

types of attributes
- Categorical
  - nominal: provide enough information to distinguish one object from another
  Example zip codes, employee ID numbers, eye color, gender
  - binary: assume only two values (e.g., yes/no, true/false, 0/1)
  - ordinal: provide enough information to order objects
  Example grades, fgood,better,bestg
- Numeric (continuous)
descriptive data summarization
gives the overall picture of the data
involves
- measuring the central tendency
  - mean
    The mean is sensitive to extreme values
  - weighted mean
  - Trimmed mean: disregards the low and high extremes
  - a measure that is not sensitive to extreme values is the
    median, which represents the middle value of an ordered set
    of observations
  - mode: the value that occurs most frequently in the set
  - midrange: average of the largest and smallest values in the
    data
- measuring the dispersion
  - range: di�erence between the largest and smallest value
  - kth percentile: value xi with the property that k percent of
  the data are smaller than xi (what percentile is the median?)
  - quartiles: 25th percentile (denoted by Q1), 50th percentile,
  and 75th percentile (denoted by Q3)
  - interquartile range:
  IQR = Q3 - Q1
  - five number summary: consists of minimum, Q1, median, Q3,
  maximum
  - standard deviation : square root of variance ^2
- graphical display of descriptive summaries
  - boxplots
  - histograms
  - scatter plots

Data cleaning
fill in missing values
e.g., Occupation="
smooth out noise, containing errors or outliers
faulty data collection instruments
human or computer error at data entry
errors in data transmission

outlier: usually, a value higher/lower than 1.5 x IQR
e.g., Salary = -10"
correct inconsistencies in the data
e.g., Age = \42", Birthday = \03/07/2010"
e.g., discrepancy between duplicate records

Given N tuples, are numerical attributes A and B correlated?

图片.png

Data integration
Data integration combines data from multiple sources into a coherent data store

Entity identiﬁcation problem
Do two objects from diﬀerent data sources refer to the same entity?
Example Is the record that has customer id = 234 (from one source) equivalent to that where cust num = 234 (from the other source)?
Metadata can help e.g., for each attribute, look at the name, meaning, data type, range of values permitted, etc

data value conﬂicts
For the same entity, attribute values from diﬀerent sources may diﬀer e.g., weight measured in kilograms or pounds

data redundancy

Data transformation
(Goal: modify the data in order to improve data mining performance)
Data reduction

统计

CH1 Data mining

Major data mining tasks

Various data repositories

CH2a Data preprocessing

Data preprocessing tasks

min-max normalization

z-score normalization

Data reduction

你可能感兴趣的:(统计)