机器学习 决策树算法
A decision tree is a non-parametric supervised machine learning algorithm. It is extremely useful in classifying or labels the object. It works for both categorical and continuous datasets. It is like a tree structure in which the root node and its child node should be present. It has a child node that denotes a feature of the dataset. Prediction can be made with a leaf or terminal node.
决策树是一种非参数有监督的机器学习算法。 在对对象进行分类或标记时非常有用。 它适用于分类数据集和连续数据集。 它就像一个树结构,其中应存在根节点及其子节点。 它有一个子节点,表示数据集的特征。 可以使用叶节点或终端节点进行预测。
递归贪婪算法 (Recursive Greedy Algorithm)
- A recursive greedy algorithm is a very simple, intuitive algorithm that is used in the optimization problems.递归贪婪算法是一种非常简单,直观的算法,用于优化问题。
- What a recursive greedy algorithm does that at every step you have a choice. Instead of evaluating all choices recursively and picking the best one choice, go with that, Recurse and do the same thing. So basically a recursive greedy algorithm picks the locally optimal choice hoping to get the best globally optimal solution. 在每个步骤中,您都有选择什么递归贪婪算法的。 与其递归地评估所有选择并选择最佳选择,不如选择递归进行相同的事情。 因此,基本上,递归贪婪算法会选择局部最优选择,以期获得最佳的全局最优解。
- Greedy algorithms are very powerful in some problems, such as Huffman encoding, or Dijkstra’s algorithm, which is used in data structure and algorithms. We will be using this algorithm for the formation of the tree. 贪婪算法在某些问题上非常强大,例如霍夫曼编码或用于数据结构和算法的Dijkstra算法。 我们将使用此算法来形成树。
- Step for learning the decision tree: 学习决策树的步骤:
- step 1: Start with an empty tree 步骤1:从一棵空树开始
- step 2: Select a feature to split the data步骤2:选择一项功能以分割数据
- For each split of the tree :对于树的每个拆分:
- step 3: If nothing to do more, predict with the last leaf node or terminal node 步骤3:如果无事可做,请预测最后一个叶节点或终端节点
- step 4: Otherwise, go to step 2 & continue (recurse) to split步骤4:否则,请转到步骤2并继续(递归)进行拆分
For example — let’s say, start with an empty tree and I pick a feature to split on. In our case, we split on credit. So we decided to say, take my data and split on which data points have excellent credit, which ones have fair credit and which ones have poor credit, and then for each subset of data, excellent, fair, poor, I continue thinking about what to do next. So I decide in the case of excellent credit, there was nothing else to do. So I stop, but in the other two cases, there was more to do and what I do is what’s called recursion. I go back to step two but only look at the subset of the data that has credit fair and then only look at a subset of data that has credit poor. Now if you look at this algorithm so far, it sounds a little abstract, but there are a few points that we need to make more concrete. We have to decide how to pick the feature to split on. We split on credit in our example, but we could have split on something else like the term of the loan or my income. And then since we having recursion here at the end, we have to figure out when to stop recursion, when to not go, and expand another node in the tree.
例如,假设从一棵空树开始,然后我选择了一个拆分要素。 在我们的情况下,我们在信用上分开。 因此,我们决定说,拿出我的数据,然后划分出哪些数据点的信誉良好,哪些数据的信誉良好,哪些数据的信誉差,然后针对每个数据子集,优良的,公平的,不良的,我继续考虑一下接下来。 因此,我决定在信誉卓著的情况下,无事可做。 所以我停了下来,但是在其他两种情况下,还有更多要做的事情,而我要做的就是所谓的递归。 我回到第二步,但仅查看信用公平的数据子集,然后仅查看信用不良的数据的子集。 现在,到目前为止,如果您看一下该算法,听起来似乎有点抽象,但是我们需要具体说明几点。 我们必须决定如何选择要分割的特征。 在我们的示例中,我们对信贷进行了拆分,但对于贷款期限或我的收入,我们可能也进行了拆分。 然后,由于最后在这里进行了递归,因此我们必须找出何时停止递归,何时不进行递归,并扩展树中的另一个节点。
问题1:特征分割选择 (Problem 1: Feature split selection)
- Given a subset of dataset m(a node in a tree)给定数据集m的子集(树中的一个节点)
- For each feature h(x): 对于每个特征h(x):
- Split data of M according to feature h (x) 根据特征h(x)拆分M的数据
- Compute classification error split计算分类错误拆分
- Chose feature h*(x) with lowest classification error选择分类误差最小的特征h *(x)
classification error is the number of mistakes in node and divides by the number of the data points in that node. For example — In the above diagram in the root node we made 18 as a risky loan that would be considered as a mistake and the total number of the data points in the root node will 40 so we can calculate the classification error of the root node. The second thing we will calculate the classification error of other nodes and try to see which feature has the lowest error that would our best split node.
分类错误是节点中的错误数,除以该节点中的数据点数。 例如,在上图中的根节点中,我们将18视为有风险的借贷,这将被视为错误,并且根节点中的数据点总数为40,因此我们可以计算出根节点的分类误差。 第二件事,我们将计算其他节点的分类误差,并尝试查看哪个特征的误差最小,这将是我们最好的分割节点。
问题2:我们将在哪里停止该条件? (Problem 2: Where will we do stopping the condition?)
- The first stop in the condition is stopped splitting when all the data agree on the value of y. 当所有数据都符合y值时,条件中的第一个停靠点将停止拆分。
- when already split on all the features. we can say there is nothing left in our dataset. 当已经拆分所有功能时。 我们可以说数据集中什么也没有。
决策树的通用参数 (The Common Parameters of the Decision Tree)
criterion: Gini or entropy, (default=gini)
准则:基尼系数或熵,(默认值=基尼系数)
The function to measure the quality of a split. Supported criteria are Gini for the Gini impurity and “entropy” for the information gain. It decides which node to splitting starts.
衡量分割质量的功能。 支持的标准是:基尼系数为基尼杂质,信息熵为“熵”。 它决定要拆分的节点开始。
max_depth: int or None, (default = None)
max_depth:int或None,(默认= None)
- The first hyperparameter is to tune in a decision tree is max_depth. 第一个超参数是调整决策树的max_depth。
- max_depth is what name suggests the maximum depth of the tree that you allow the tree to grow to max_depth是名称,它表示您允许树长到的最大树深
- The deeper the tree, the more splits it has & it captures more information about the data.树越深,其分裂就越多,它会捕获有关数据的更多信息。
- however, in general, a decision tree overfits for large depth values. The tree perfectly predicts all of the train data, but it fails to capture the pattern in new data. 但是,一般而言,决策树无法适合较大的深度值。 该树完美地预测了所有火车数据,但是无法捕获新数据中的模式。
- so you have to find the right max depth using hyperparameter tuning either grid search or random search to arrive at the best possible values of max_depth. 因此,您必须使用超参数调整网格搜索或随机搜索来找到正确的最大深度,以获得最大可能的max_depth值。
min_samples_split: int or float, (default = 2)
min_samples_split : int或float,(默认= 2)
- An internal node will have further splits (also called children) 内部节点将进一步分裂(也称为子节点)
- min_samples_split specifies the minimum number of sample required to split an internal node. min_samples_split指定拆分内部节点所需的最小样本数。
- we can either specify a number to denote the minimum number or a fraction to denote the percentage of samples in an internal node. 我们可以指定一个数字来表示最小数量,也可以指定一个分数来表示内部节点中的样本百分比。
min_samples_leaf: int or float (default = 1)
min_samples_leaf:int或float(默认= 1)
- A leaf node is a node without any children(without any further splits). 叶节点是没有任何子节点(没有任何进一步拆分)的节点。
- min_samples_leaf is the minimum number of samples required to be at a leaf node. min_samples_leaf是在叶节点处所需的最小样本数。
- This parameter is similar to min_sample_splits, however, this describes the minimum number of samples at the leaf, the base of the tree. 此参数类似于min_sample_splits,但是,它描述了在树叶(树的底部)处的最小样本数。
- This hyperparameter can also avoid overfitting. 此超参数还可避免过拟合。
max_features: int, float, string (default = None)
max_features:整数,浮点数,字符串(默认=无)
- max_features represents the number of features to consider when looking for the best split. max_features代表寻找最佳分割时要考虑的特征数量。
- We can either specify a number to denote the max_features at each split or a fraction to denote the percentage of features to consider while making a split. 我们可以指定一个数字来表示每次拆分时的max_features,也可以指定一个分数来表示进行拆分时要考虑的功能百分比。
- we also have options such as sqrt, log2, None. 我们也有诸如sqrt,log2,None之类的选项。
- This method is used to control overfitting. In fact, it is similar to the technique used in a random forest, except in random forest we start with sampling also from the data and we generate multiple trees. 此方法用于控制过拟合。 实际上,它与随机森林中使用的技术相似,不同之处在于,在随机森林中,我们也从数据中采样,然后生成多个树。
决策树算法代码: (Code For Decision Tree Algorithms:)
We will be using a dataset from the LendingClub. A parsed and clean form of the dataset is available here. Make sure you download the dataset before running the following command.
我们将使用LendingClub中的数据集。 此处提供了已解析且干净的数据集形式。 在运行以下命令之前,请确保下载数据集。
The train and validation dataset can found here.
训练和验证数据集可以在这里找到。
#Identifying safe loans with decision trees
#The LendingClub is a peer-to-peer leading company that directly connects borrowers and potential lenders/investors.
#In this notebook, you will build a classification model to predict whether or not a loan provided by LendingClub is likely to default.
#You will use data from the LendingClub to predict whether a loan will be paid off in full or the loan will be charged off and possibly go into default
import pandas as pd
import numpy as np
loans = pd.read_csv("lending-club-data.csv")
loans.head()
#Exploring some features
#Let's quickly explore what the dataset looks like. First, let's print out the column names to see what features we have in this dataset.
loans.columns
#Exploring the target column
#The target column (label column) of the dataset that we are interested in is called bad_loans.
#In this column 1 means a risky (bad) loan 0 means a safe loan.
#In order to make this more intuitive and consistent with the lectures, we reassign the target to be:
# +1 as a safe loan,
# -1 as a risky (bad) loan.
#We put this in a new column called safe_loans.
# safe_loans = 1 => safe
# safe_loans = -1 => risky
loans['safe_loans'] = loans['bad_loans'].apply(lambda x : +1 if x==0 else -1)
loans = loans.drop('bad_loans',axis=1)
features = ['grade', # grade of the loan
'sub_grade', # sub-grade of the loan
'short_emp', # one year or less of employment
'emp_length_num', # number of years of employment
'home_ownership', # home_ownership status: own, mortgage or rent
'dti', # debt to income ratio
'purpose', # the purpose of the loan
'term', # the term of the loan
'last_delinq_none', # has borrower had a delinquincy
'last_major_derog_none', # has borrower had 90 day or worse rating
'revol_util', # percent of available credit being used
'total_rec_late_fee', # total late fees received to day
]
target = 'safe_loans' # prediction target (y) (+1 means safe, -1 is risky)
# Extract the feature columns and target column
loans = loans[features + [target]]
#Sample data to balance classes¶
import matplotlib.pyplot as plt
count_classes = pd.value_counts(loans['safe_loans'], sort=True)
count_classes.plot(kind='bar', rot=0)
plt.title("loans class distribution ")
plt.xlabel("class")
plt.ylabel("Frequency")
from imblearn.under_sampling import NearMiss
categorical_var = [m for m in loans.columns if loans[m].dtypes == object]
categorical_var
loans.head()
one_hot_df = pd.get_dummies(loans)
import json
with open('module-5-assignment-1-train-idx.json', 'r') as f: # Reads the list of most frequent words
train_idx = json.load(f)
with open('module-5-assignment-1-validation-idx.json', 'r') as f1: # Reads the list of most frequent words
validation_idx = json.load(f1)
train_data = one_hot_df.iloc[train_idx]
validation_data = one_hot_df.iloc[validation_idx]
safe_loans_prob = round(float(sum(validation_data['safe_loans'] == 1))/len(validation_data),2)
safe_loans_prob
#Use decision tree to build a classifier
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
decision_tree_model = DecisionTreeClassifier(max_depth=6)
x = train_data.drop('safe_loans',1)
decision_tree_model.fit(x,train_data[target])
decision_tree_model.predict(validation_data.drop('safe_loans',1))
decision_tree_model.predict_proba(validation_data.drop('safe_loans',1))
#exploring the accuracy
decision_tree_model.predict_proba(validation_data.drop('safe_loans',1))
Thanks for reading…..
谢谢阅读…..
翻译自: https://medium.com/ai-in-plain-english/decision-tree-algorithm-in-machine-learning-8aecef85ae6d
机器学习 决策树算法