使用ID3算法构造决策树 - 简介、概念及实例

 Quotes: http://kelvinzh.spaces.live.com/blog/cns!D49A2DF3B9825B01!944.entry   part1

               http://kelvinzh.spaces.live.com/blog/cns!D49A2DF3B9825B01!953.entry   part2

Part1:

ID3 Overview

In 1983, Professor J. Ross Quinlan introduced a simple decision tree learning algorithm called Iterative Dichotomiser 3 (ID3) at University of Sydney.  ID3 is a heuristic method. The basic idea is to generate a decision tree by employing a top-down, greedy search through the training data set at each of its tree node, seeking the attribute that best separates the instances. Top-down, greedy search means ID3 recursively (top-down) checks all possible values of an attribute to get the best outcome. It picks any newer and better attribute and “forgets” the earlier choices (greedy).

The intention of ID3 is to produce relatively small decision tree.

 

Basic ID3 Concepts

How does ID3 select the best attribute? To answer this question, a new metric called Information Gain is introduced.  Information Gain is based on Occam's Razor which uses a concept of Entropy. Entropy measures the amount of information in an attribute. Given a collection S containing m possible outcomes, (X1, X2.......Xm):

Entropy(S)= ∑-ρ(Xi)Log2ρ(Xi)

Where ρ(Xi) is the proportion of S belonging to class Xi.

To simplify for better understanding, let’s assume, at each node, all instances can be partitioned into two different categories using the decision tree, namely Positive (P) and Negative (N).

Entropy(S)= -ρ(P)Log2ρ(P) -ρ(N)Log2ρ(N)

Example

If S can be described using the following diagram:

 
   

 

Then the corresponding Entropies are

Entropy(Left)= -0.5Log20.5 - 0.5Log20.5 = 1

Entropy(Right)= -0.67Log20.67 - 0.33Log20.33 = 0.92

Note that Entropy is the measure of the impurity in a collection of the training data set. Thus, entropy is 0 if all members of S belong to the same class (ie, S is (n+, 0-) or (0+, n-) ). In other words, the data is perfectly classified.

Information Gain is defined as the measure of expected reduction in Entropy [1].

Gain(S, A) = Entropy(S) - S ((|Sv| / |S|) * Entropy(Sv))  

Where:

S is each value v of all possible values of attribute A

Sv = subset of S for which attribute A has value v

|Sv| = number of elements in Sv

|S| = number of elements in S

 

Example:

Suppose we have a real estate data set S containing 14 examples with 4 attributes one of which is Traffic Convenience. It can have 2 possible values: Good and Bad. The classification of these 14 examples indicates whether to buy this property or not. Among the 14 outcomes are 9 YES (buy) and 5 NO (not to buy). Suppose there are 8 occurrences of Traffic = Good and 6 occurrences of Traffic = Bad.

Attribute

Possible Values

Traffic

I (Inconvenient)

C (Convenient)

Decision

p(yes/to buy)

n(no/not to buy)

For Traffic = Convenient, 6 of the examples are YES and 2 are NO. For Traffic = Inconvenient, 3 are YES and 3 are NO.

Traffic

C

I

C

C

C

I

I

Decision

n

n

p

p

p

n

p

Traffic

C

C

C

I

I

C

I

Decision

n

p

p

p

p

p

n

Therefore

      Entropy(S) = - (9/14)*Log2 (9/14) - (5/14)*Log2 (5/14) = 0.940

      Entropy(SConvenient) = - (6/8)*Log2 (6/8) - (2/8)*Log2 (2/8) = 0.811

      Entropy(SInconvenient) = - (3/6)*Log2 (3/6) - (3/6)*Log2 (3/6) = 1.00

      Gain(S, Traffic) = Entropy(S)-(8/14)*Entropy(SConvenient) -(6/14)*Entropy(SInconvenient)

= 0.940 - (8/14)*0.811 - (6/14)*1.00

= 0.048

 

ID3 Algorithm:

The ID3 algorithm can be summarized as follows: [2]

  1. Take all unused attributes and count their entropy concerning test samples
  2. Choose attribute for which entropy is minimum
  3. Make node containing that attribute
  4. Operate using above method recursively till all attributes is used

If we take a step by step view on ID3 Algorithm, this can be decomposed into the followings:

ID3 (Examples, Target_Attribute, Attributes)

-Create a root node for the tree, containing the whole training set as its subset.

-If all instances are positive, Return the single-node tree Root, with label = +

-If all examples are negative, Return the single-node tree Root, with label = -

-If number of predicting attributes is empty, then Return the single node tree Root, with label = most common value of the target attribute in the examples

-Otherwise Begin

   o    A ß The Attribute that best classifies examples

   o    Decision Tree attribute for Root ß A

   o    For each positive value, vi, of A,

       §  Add a new tree branch below Root, corresponding to the test A = vi

       §  Let Examples(vi), be the subset of examples that have the value vi for A

       §  If Examples(vi) is empty

       ·         Then below this new branch add a leaf node with label = most common target value in the examples

       ·         Else below this new branch add the Subtree                                            ID3 (Examples(vi), Target_Attribute, Attributes – {A})

-End

-Return Root [2]

Another detailed explanation about ID3 algorithm is provided by Professor Ernest Davis from New York University on his website. Some pseudo code can be found on http://cs.nyu.edu/faculty/davise/ai/id3.pdf.

 

Part2:

An Example of Decision Tree generation using ID3 Algorithm:

 Now let us have a deep look at the previous example about the decision on buying a real estate. In addition to the Traffic Convenience, we introduce another three attributes, namely Price, Location and Size.

Attribute

Possible Values

Location

Good

Average

Bad

Price

High

Moderate

Low

Size

Small

Large

 

Traffic

Inconvenient

Convenient

 

Decision

p(yes/to buy)

n(no/not to buy)

 

All details of the training data set are listed below:

Property NO    Location    Price      Size       Traffic      Decision
H1                  Good       High      Small    Convenient      n
H2                  Good       High      Small    Inconvenient    n
H3                 Average    High      Small    Convenient      p
H4                   Bad     Moderate  Small    Convenient      p
H5                   Bad        Low      Large    Convenient      p
H6                   Bad        Low      Large    Inconvenient    n
H7                 Average    Low      Large    Inconvenient    p
H8                  Good    Moderate  Small    Convenient       n
H9                  Good       Low      Large    Convenient       p
H10                 Bad     Moderate  Large    Convenient       p
H11                Good    Moderate  Large    Inconvenient     p
H12               Average Moderate  Small    Inconvenient     p
H13               Average    High      Large    Convenient       p
H14                 Bad     Moderate  Small    Inconvenient     n

According to the previous explanation,  we now first create a rootNode, containing the whole training set  as its subset. then computer Entropy:

Decision

P

N

Property NO

H3, H4, H5, H7, H9, H10, H11, H12, H13

H1, H2, H6, H8, H14

Entropy(rootNode.subset) = -(9/14)log2(9/14) - (5/14)log2(5/14)=0.940

For this node (rootNode), there are 4 unused attributes. Calculate all the Information Gain of these 4.

Gain(S,Traffic) = Entropy(S)-(8/14)Entropy(SConvenient) - (6/14)Entropy(SInconvenient) =0.048

Gain(S,Size) =0.151

Gain(S, Price)=0.029

Gain(S, Location) = 0.246

Select the attribute with the highest Information Gain which is Location as the first tree splitting criteria.  Secondly, we need to decide which attribute is the next to be picked up.

Again calculate the information gain of the rest 3 attributes under the condition of Location=Good.

Entropy(SGood)=0.970

Gain(SGood, Size) = 0.970

Gain(SGood, Price) = 0.570

Gain(SGood, Traffic) = 0.019

Hence Size is selected to create the node. This process repeats until all data is perfectly classified or all the attributes are used.

 

Shortages of ID3:

Although ID3 is broadly used and considered to be easy, straight-forward and efficient. There are still some shortages for this method.

1. Attributes with more possible values tend to have higher Information Gain

2. Each node can be represented by only one attribute which means the colorations or relationships between attributes are ignored.ID3 prefers discrete attribute values.

3. Although we can still employ ID3 on some continuous value attributes by using particular data transformation methods, it is still considered not practical.

4. ID3 does not have a good mechanism to deal with noises and missing values

  

 

References:

  1. T. Mitchell, "Decision Tree Learning", in T. Mitchell, Machine Learning, The McGraw-Hill Companies, Inc., 1997, pp. 52-78.
  2. http://en.wikipedia.org/wiki/ID3_algorithm  18/09/2007
  3. P. Winston, "Learning by Building Identification Trees", in P. Winston, Artificial Intelligence, Addison-Wesley Publishing Company, 1992, pp. 423-442.

 

你可能感兴趣的:(DataMining)