Ever wondered how does a loan application gets accepted or rejected ? Ever given a thought how does the sales team realize the future demands for any product will increase and so we need to keep the warehouse updated with enough stock? Well its as simple as it gets with the answer being Classification using the Decision Tree Algorithm.
有没有想过贷款申请如何被接受或拒绝? 有没有想过销售团队如何意识到将来对任何产品的需求都会增加,所以我们需要保持足够的库存来更新仓库? 答案很简单,即使用决策树算法进行分类。
There are many factors influencing the output as in our case we consider the output as Accept the loan application or not. These factors are known as Attributes. Now the next question which pops up is how will we come to a conclusion which of 10 Attributes or features best suits to split as subsets of the tree. In order to achieve that we apply the concept of least impurity feature subset which in basic laymans term suggests the feature having less irregularities in terms of contributon to output.
有许多因素影响产出,因为在我们的案例中,我们将产出视为是否接受贷款申请。 这些因素称为属性。 现在出现的下一个问题是,我们将如何得出最适合拆分为树子集的10个属性或功能中的哪一个结论。 为了实现这一点,我们应用了最少杂质特征子集的概念,该子集在基本的外行术语中建议该特征在对输出的贡献方面具有较少的不规则性。
Parts of Decision Tree
决策树的一部分
A decision tree is a tree like flowchart structure which consists of root node or decision node , Leaf node and decision rule.
决策树是一种类似于流程图的树,由根节点或决策节点,叶节点和决策规则组成。
Decision Node : A node which partitions based on attributes .
决策节点:根据属性进行分区的节点。
Decision rule : A branch representing the state of attributes corresponding to decision nodes.
决策规则:代表与决策节点相对应的属性状态的分支。
Leaf Node: A node derived from Decision node consisting of only incoming branch and no outcomes.
叶节点:从“决策”节点派生的节点,仅由输入分支组成,没有结果。
How Does the Algorithm works?
该算法如何工作?
- Based on the Gini Impurity method get the best attribute.
-基于基尼杂质方法获得最佳属性。
- Set it as root node alongwith the small subset of records from the dataset associated with it.
-将其以及与之关联的数据集中的一小部分记录设置为根节点。
- Repeat the above process for the child nodes as well until any of below conditions satisfy.
-对子节点也重复上述过程,直到满足以下任何条件。
1. Majority of records of dataset belongs to an attribute.
1.数据集的记录多数属于一个属性。
2. No more attributes left to be classified.
2.不再需要分类的属性。
Example :
范例:
Consider the below case of consisting dataset of 331 patients records with the Covid Symptoms as Attributes. We are aiming at classifying the patient as Covid positive or negative.
考虑以下包含331个以Covid症状为特征的患者记录的数据集的情况。 我们的目标是将患者分为Covid阳性或阴性。
In this example we want to create a tree that uses Sneeze , cough, fever and body ache status to predict whether or not a patient has Covid. We start looking at how well the Cough alone predicts the Covid. We keep track of all the 331 patients in the list and try to map to the list of patients having Covid or not as per symptoms of Cough or not.
在此示例中,我们要创建一棵使用打喷嚏,咳嗽,发烧和身体疼痛状态来预测患者是否患有Covid的树。 我们开始研究仅凭咳嗽对Covid的预测。 我们会跟踪列表中的所有331名患者,并尝试根据是否出现咳嗽症状映射到是否患有Covid的患者列表。
Similarly we calculate how well the Sneeze, Fever and Bodyache can predict the Corona Disease.
同样,我们计算打喷嚏,发烧和身体疼痛可以很好地预测电晕病。
As we can see that none of the leaf nodes are 100% ‘Yes to Corona Disease’ neither they are100% ‘No to Corona Disease’ and so they all are considered impure. To determine the best attribute we need to separate out the minimal impurity based Attribute. There are many ways to measure the impurity but in the above case we can use the Gini Split method.
正如我们所看到的,没有一个叶节点100%“对电晕病是”,也不是100%“对电晕病不是”,因此它们都被认为是不纯的。 为了确定最佳属性,我们需要分离出基于最小杂质的属性。 有许多方法可以测量杂质,但在上述情况下,我们可以使用基尼拆分法。
Now lets consider the calculation of Gini Split for left leaf node of Cough Decision tree shown in above images.
现在,考虑上图所示的咳嗽决策树左叶节点的基尼分裂计算。
For left node the Gini Impurity
对于左结点,基尼杂质
=1-(a)*2-(b)*2
= 1-(a)* 2-(b)* 2
=1-(105/(105+139))*2 -(139/(139+105))*2
= 1-(105 /(105 + 139))* 2-(139 /(139 + 105))* 2
Where
哪里
a = probability of ‘Yes’ type of Corona patients
a =电晕患者为“是”类型的可能性
b = probability of ‘No’ type of Corona patients
b =电晕患者为“否”类型的可能性
Gini Impurity of left node = 0.395
左结点的基尼杂质= 0.395
Gini Impurity of right node = 0.33
右结点的基尼杂质= 0.33
Now that we have calculated the Gini impurity for both the leaf nodes we can calculate the total Gini Impurity for using the Cough to separate patients with and patients without the Corona Disease.
现在,我们已经计算了两个叶节点的吉尼杂质,我们可以计算出使用咳嗽将患有电晕病的患者和没有患有电晕病的患者分开的总吉尼杂质。
Gini Impurity for Cough = weighted average of Gini impurities for leaf nodes
基尼杂质咳嗽=叶结的基尼杂质加权平均值
= (144/(144+139))*0.395 + (139/(139+144))*0.336
=(144 /(144 + 139))* 0.395 +(139 /(139 + 144))* 0.336
= 0.364
= 0.364
Using the above method we can calculate the Gini impurity of Sneeze , Fever and bodyache
使用上述方法,我们可以计算出打喷嚏,发烧和身体疼痛的基尼杂质。
Gini impurity for Cough = 0.364
咳嗽的基尼杂质= 0.364
Gini impurity for Sneeze = 0.360
打喷嚏的基尼杂质= 0.360
Gini impurity for Fever = 0.381
发烧的基尼杂质= 0.381
Gini impurity for Bodyache = 0.374
身体痛的基尼杂质= 0.374
So we come to a conclusion that the decision node has to Sneeze attribute.
因此,我们得出结论,决策节点必须具有Sneeze属性。
Now we focus on finding out the Attribute that best fits on the left node of Sneeze Decision node. As we can see that there are (37+127) = 164 patients in the left node of the Sneeze Tree. Now we have to calculate the Gini impurity for the remaining Attributes Cough , Fever , Bodyache which classify the patients as covid(37 patients) and non covid(127 patients).
现在,我们着重于找出最适合“喷嚏决策”节点左节点的属性。 我们可以看到,在喷嚏树的左结中有(37 + 127)= 164名患者。 现在,我们必须计算其余的咳嗽,发烧,身体痛属性的基尼杂质,这些属性将患者分为Covid(37例)和非Covid(127例)。
Gini Impurity of Cough = 0.3
基尼咳嗽杂质= 0.3
Gini Impurity of Fever = 0.290
基尼热病杂质= 0.290
Gini Impurity of Bodyache = 0.310
基尼不健康的基尼杂质= 0.310
Thus we conclude that the left leaf node of Sneeze node will set to the Fever Attribute.
因此,我们得出结论,喷嚏节点的左叶节点将设置为发烧属性。
Now we focus on finding out the Attribute that best fits on the left node of Fever Decision node. As we can see that there are (24+25) = 49 patients in the left node of the Sneeze Tree. Now we have to calculate the Gini impurity for the remaining Attributes Cough , Fever , Bodyache which classify the patients as covid(24 patients) and non covid(25 patients).
现在,我们着重于找出最适合发烧决策节点左节点的属性。 我们可以看到,在打喷嚏树的左结中有(24 + 25)= 49位患者。 现在,我们必须计算其余的咳嗽,发烧,身体痛属性的基尼杂质,这些属性将患者分为Covid(24例)和非Covid(25例)。
Gini Impurity of Cough = 0.20
基尼咳嗽杂质= 0.20
Gini Impurity of Bodyache = 0.29
基尼杂质的基尼杂质= 0.29
Thus we conclude that the left leaf node of Fever node will set to the Cough Attribute.
因此我们得出结论,发烧节点的左叶节点将设置为咳嗽属性。
Now consider the left and right node of Cough and calculate the Gini Iimpurity for these nodes individually . Also calculate the Gini Impurity of the Bodyache. It can be seen that the Gini Impurity of the nodes is very lesser compared with Bodyache impurity. So they can exist as individual nodes or leaf nodes.
现在考虑咳嗽的左右节点并分别计算这些节点的基尼杂质。 还要计算身体疼痛的基尼杂质。 可以看出,与基体杂质相比,结点的基尼杂质要小得多。 因此它们可以作为单个节点或叶节点存在。
Gini Impurity of Bodyache = 0.29
基尼杂质的基尼杂质= 0.29
Gini Impurity of left node of Cough = 0.12
咳嗽左结的吉尼杂质= 0.12
Gini Impurity of right node of Cough = 0.19
咳嗽右结的吉尼杂质= 0.19
This is the entire process of constructing the decision tree from scratch based on the Impurity.
这是基于杂质从头开始构建决策树的整个过程。
Thanks for reading!!
谢谢阅读!!
The references for the article is:
本文的引用是:
https://www.youtube.com/watch?v=7VeUPuFGJHk
https://www.youtube.com/watch?v=7VeUPuFGJHk
https://www.datacamp.com/community/tutorials/decision-tree-classification-python
https://www.datacamp.com/community/tutorials/decision-tree-classification-python
翻译自: https://medium.com/analytics-vidhya/decision-tree-classifier-5378bff033c5