Quotes: http://kelvinzh.spaces.live.com/blog/cns!D49A2DF3B9825B01!944.entry part1
http://kelvinzh.spaces.live.com/blog/cns!D49A2DF3B9825B01!953.entry part2
Part1:
In 1983, Professor J. Ross Quinlan introduced a simple decision tree learning algorithm called Iterative Dichotomiser 3 (ID3) at University of Sydney. ID3 is a heuristic method. The basic idea is to generate a decision tree by employing a top-down, greedy search through the training data set at each of its tree node, seeking the attribute that best separates the instances. Top-down, greedy search means ID3 recursively (top-down) checks all possible values of an attribute to get the best outcome. It picks any newer and better attribute and “forgets” the earlier choices (greedy).
The intention of ID3 is to produce relatively small decision tree.
How does ID3 select the best attribute? To answer this question, a new metric called Information Gain is introduced. Information Gain is based on Occam's Razor which uses a concept of Entropy. Entropy measures the amount of information in an attribute. Given a collection S containing m possible outcomes, (X1, X2.......Xm):
Entropy(S)= ∑-ρ(Xi)Log2ρ(Xi)
Where ρ(Xi) is the proportion of S belonging to class Xi.
To simplify for better understanding, let’s assume, at each node, all instances can be partitioned into two different categories using the decision tree, namely Positive (P) and Negative (N).
Entropy(S)= -ρ(P)Log2ρ(P) -ρ(N)Log2ρ(N)
Example
If S can be described using the following diagram:
Then the corresponding Entropies are
Entropy(Left)= -0.5Log20.5 - 0.5Log20.5 = 1
Entropy(Right)= -0.67Log20.67 - 0.33Log20.33 = 0.92
Note that Entropy is the measure of the impurity in a collection of the training data set. Thus, entropy is 0 if all members of S belong to the same class (ie, S is (n+, 0-) or (0+, n-) ). In other words, the data is perfectly classified.
Information Gain is defined as the measure of expected reduction in Entropy [1].
Gain(S, A) = Entropy(S) - S ((|Sv| / |S|) * Entropy(Sv))
Where:
S is each value v of all possible values of attribute A
Sv = subset of S for which attribute A has value v
|Sv| = number of elements in Sv
|S| = number of elements in S
Example:
Suppose we have a real estate data set S containing 14 examples with 4 attributes one of which is Traffic Convenience. It can have 2 possible values: Good and Bad. The classification of these 14 examples indicates whether to buy this property or not. Among the 14 outcomes are 9 YES (buy) and 5 NO (not to buy). Suppose there are 8 occurrences of Traffic = Good and 6 occurrences of Traffic = Bad.
Attribute |
Possible Values |
|
Traffic |
I (Inconvenient) |
C (Convenient) |
Decision |
p(yes/to buy) |
n(no/not to buy) |
For Traffic = Convenient, 6 of the examples are YES and 2 are NO. For Traffic = Inconvenient, 3 are YES and 3 are NO.
Traffic |
C |
I |
C |
C |
C |
I |
I |
Decision |
n |
n |
p |
p |
p |
n |
p |
Traffic |
C |
C |
C |
I |
I |
C |
I |
Decision |
n |
p |
p |
p |
p |
p |
n |
Therefore
Entropy(S) = - (9/14)*Log2 (9/14) - (5/14)*Log2 (5/14) = 0.940
Entropy(SConvenient) = - (6/8)*Log2 (6/8) - (2/8)*Log2 (2/8) = 0.811
Entropy(SInconvenient) = - (3/6)*Log2 (3/6) - (3/6)*Log2 (3/6) = 1.00
Gain(S, Traffic) = Entropy(S)-(8/14)*Entropy(SConvenient) -(6/14)*Entropy(SInconvenient)
= 0.940 - (8/14)*0.811 - (6/14)*1.00
= 0.048
The ID3 algorithm can be summarized as follows: [2]
If we take a step by step view on ID3 Algorithm, this can be decomposed into the followings:
ID3 (Examples, Target_Attribute, Attributes)
-Create a root node for the tree, containing the whole training set as its subset.
-If all instances are positive, Return the single-node tree Root, with label = +
-If all examples are negative, Return the single-node tree Root, with label = -
-If number of predicting attributes is empty, then Return the single node tree Root, with label = most common value of the target attribute in the examples
-Otherwise Begin
o A ß The Attribute that best classifies examples
o Decision Tree attribute for Root ß A
o For each positive value, vi, of A,
§ Add a new tree branch below Root, corresponding to the test A = vi
§ Let Examples(vi), be the subset of examples that have the value vi for A
§ If Examples(vi) is empty
· Then below this new branch add a leaf node with label = most common target value in the examples
· Else below this new branch add the Subtree ID3 (Examples(vi), Target_Attribute, Attributes – {A})
-End
-Return Root [2]
Another detailed explanation about ID3 algorithm is provided by Professor Ernest Davis from New York University on his website. Some pseudo code can be found on http://cs.nyu.edu/faculty/davise/ai/id3.pdf.
Part2:
Now let us have a deep look at the previous example about the decision on buying a real estate. In addition to the Traffic Convenience, we introduce another three attributes, namely Price, Location and Size.
Attribute |
Possible Values |
||
Location |
Good |
Average |
Bad |
Price |
High |
Moderate |
Low |
Size |
Small |
Large |
|
Traffic |
Inconvenient |
Convenient |
|
Decision |
p(yes/to buy) |
n(no/not to buy) |
All details of the training data set are listed below:
Property NO Location Price Size Traffic Decision
H1 Good High Small Convenient n
H2 Good High Small Inconvenient n
H3 Average High Small Convenient p
H4 Bad Moderate Small Convenient p
H5 Bad Low Large Convenient p
H6 Bad Low Large Inconvenient n
H7 Average Low Large Inconvenient p
H8 Good Moderate Small Convenient n
H9 Good Low Large Convenient p
H10 Bad Moderate Large Convenient p
H11 Good Moderate Large Inconvenient p
H12 Average Moderate Small Inconvenient p
H13 Average High Large Convenient p
H14 Bad Moderate Small Inconvenient n
According to the previous explanation, we now first create a rootNode, containing the whole training set as its subset. then computer Entropy:
Decision |
P |
N |
Property NO |
H3, H4, H5, H7, H9, H10, H11, H12, H13 |
H1, H2, H6, H8, H14 |
Entropy(rootNode.subset) = -(9/14)log2(9/14) - (5/14)log2(5/14)=0.940
For this node (rootNode), there are 4 unused attributes. Calculate all the Information Gain of these 4.
Gain(S,Traffic) = Entropy(S)-(8/14)Entropy(SConvenient) - (6/14)Entropy(SInconvenient) =0.048
Gain(S,Size) =0.151
Gain(S, Price)=0.029
Gain(S, Location) = 0.246
Select the attribute with the highest Information Gain which is Location as the first tree splitting criteria. Secondly, we need to decide which attribute is the next to be picked up.
Again calculate the information gain of the rest 3 attributes under the condition of Location=Good.
Entropy(SGood)=0.970
Gain(SGood, Size) = 0.970
Gain(SGood, Price) = 0.570
Gain(SGood, Traffic) = 0.019
Hence Size is selected to create the node. This process repeats until all data is perfectly classified or all the attributes are used.
Although ID3 is broadly used and considered to be easy, straight-forward and efficient. There are still some shortages for this method.
1. Attributes with more possible values tend to have higher Information Gain
2. Each node can be represented by only one attribute which means the colorations or relationships between attributes are ignored.ID3 prefers discrete attribute values.
3. Although we can still employ ID3 on some continuous value attributes by using particular data transformation methods, it is still considered not practical.
4. ID3 does not have a good mechanism to deal with noises and missing values