DataMining(3)_Classification and Prediction

What is classification? What is prediction?

Classification:predicts categorical class labels (discrete or nominal),classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data.
Prediction:models continuous-valued functions, i.e., predicts unknown or missing values.

Issues regarding classification and prediction

Data cleaning
Preprocess data in order to reduce noise and handle missing values
Relevance analysis (feature selection)
Remove the irrelevant or redundant attributes
Data transformation
Generalize and/or normalize data
Accuracy
classifier accuracy: predicting class label
predictor accuracy: guessing value of predicted attributes
Speed
time to construct the model (training time)
time to use the model (classification/prediction time)
Robustness: handling noise and missing values
Scalability: efficiency in disk-resident databases
Interpretability
understanding and insight provided by the model
Other measures, e.g., goodness of rules, such as decision tree size or compactness of classification rules

Classification by decision tree induction

General Structure of Hunt’s Algorithm
Let Dt be the set of training records that reach a node t
General Procedure:
If Dtcontains records that belong the same class yt , then t is a leaf node labeled as yt
If Dt is an empty set, then t is a leaf node labeled by the default class, yd
If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset.

Issues
Determine how to split the records

  1. How to specify the attribute test condition?
  2. How to determine the best split?
    Determine when to stop splitting
    Expected information(entropy) needed to classify a tuple in D:
    InfoA(D)=mi=1log2(pi)
    Information needed (after using A to split D into v partitions) to classify D:
    InfoA(D)=vj=1|Dj||D|×I(Dj)
    Information gained by branching on attribute A
    Gain(A)=Info(D)InfoA(D)

Information Gain: Gainsplit=Entropy(p)i=1kninEntropy(i))

Gain Ratio: GainRATIOsplit=GAINSplitSplitINFO
SplitINFO=i=1kninlognin

Gini index: gini(D)=1nj=1p2j
where pj is the relative frequency of class j in D

The three measures, in general, return good results but
Information gain:
biased towards multivalued attributes
Gain ratio:
tends to prefer unbalanced splits in which one partition is much smaller than the others
Gini index:
biased to multivalued attributes
has difficulty when # of classes is large
tends to favor tests that result in equal-sized partitions and purity in both partitions

你可能感兴趣的:(计算机-数据挖掘)