树与森林算法是非常广泛使用的机器学习算法,有如下特性:
基于树的学习模型是决策树:
树模型的一个好处是:可解属性
创建一棵决策树
DecisionTreeClassifier
decision_tree_classifier.py
# Load libraries
from sklearn.tree import DecisionTreeClassifier
from sklearn import datasets
# Load data
iris = datasets.load_iris()
features = iris.data
target = iris.target
# 创建一棵决策树
decisiontree = DecisionTreeClassifier(random_state=0)
# 训练模型
model = decisiontree.fit(features, target)
# Make new observation
observation = [[5, 4, 3, 2]]
# Predict observation's class预测类
print(model.predict(observation))
# 预测属于各个类的概率
print(model.predict_proba(observation))
# Create decision tree classifier object using 信息熵
decisiontree_entropy = DecisionTreeClassifier(
criterion='entropy', random_state=0)
# Train model
model_entropy = decisiontree_entropy.fit(features, target)
DecisionTreeClassifier
使用Gini impurity(基尼指数)来衡量混杂度
model.predict(observation)
,也可以查看概率model.predict_proba(observation)
如果使用不一样的指标,可以使用criterion
参数来指定:
# Create decision tree classifier object using 信息熵
decisiontree_entropy = DecisionTreeClassifier(
criterion='entropy', random_state=0)
# Train model
model_entropy = decisiontree_entropy.fit(features, target)
scikit-learn’s DecisionTreeRegressor
# Load libraries
from sklearn.tree import DecisionTreeRegressor
from sklearn import datasets
# Load data with only two features
boston = datasets.load_boston()
features = boston.data[:, 0:2]
target = boston.target
# Create decision tree classifier object
decisiontree = DecisionTreeRegressor(random_state=0)
# 训练模型
model = decisiontree.fit(features, target)
# Make new observation
observation = [[0.02, 16]]
# 预测
model.predict(observation)
decisiontree_mae = DecisionTreeRegressor(random_state=0, criterion='mae')
model_mae = decisiontree_mae.fit(features, target)
类似于分类器,决策树回归模型只是把Gini(entropy)换成了MSE:
DecisionTreeRegressor
:
预测:predict(observation)
更改标准:criterion
# 预测
model.predict(observation)
decisiontree_mae = DecisionTreeRegressor(random_state=0, criterion='mae')
model_mae = decisiontree_mae.fit(features, target)
DOT
格式# Load libraries
import pydotplus
from sklearn.tree import DecisionTreeClassifier
from sklearn import datasets
from IPython.display import Image
from sklearn import tree
# Load data
iris = datasets.load_iris()
features = iris.data
target = iris.target
# Create decision tree classifier object
decisiontree = DecisionTreeClassifier(random_state=0)
# Train model
model = decisiontree.fit(features, target)
# 创建DOT数据
dot_data = tree.export_graphviz(decisiontree,
out_file=None, #导出的文件
feature_names=iris.feature_names,
class_names=iris.target_names)
# Draw graph
graph = pydotplus.graph_from_dot_data(dot_data)
# Show graph
Image(graph.create_png())
# pdf文件
graph.write_pdf("iris.pdf")
# png图片
graph.write_png("iris.png")
我们可以查看决策树模型,这也就是为什么大家认为决策树是最可解释的模型之一
可以导出为DOT图格式,还可以生成PDF或者PNG
需要安装pydotplus,我在跑的时候还遇到了问题
pydotplus.graphviz.InvocationException: GraphViz's executables not found
如果使用conda环境一定要conda的包
conda install pydotplus
conda install graphviz
RandomForestClassifier
# Load libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets
# Load data
iris = datasets.load_iris()
features = iris.data
target = iris.target
# 训练一个随机森林算法
randomforest = RandomForestClassifier(random_state=0, n_jobs=-1)
# 训练模型
model = randomforest.fit(features, target)
# 创建新的样本集
observation = [[5, 4, 3, 2]]
# 预测
print(model.predict(observation))
# 使用信息熵来作为指标训练随机森林
random_forest_entropy = RandomForestClassifier(
criterion="entropy", random_state=0)
# Train model
model_entropy = random_forest_entropy.fit(features, target)
print(model_entropy.predict(observation))
RandomForestClassifier
n_estimators
表示多少个树组成,默认是10random_state
每颗树的生成模式criterion
表示判别标准,常用的有gini、entropymax_features
决定最多一个节点会考虑多少个特征,默认情况下,max_features的值会设置为sqrt,也就是说有n个属性,那么一个节点会考虑 n \sqrt n n个属性bootstrap
表示是否使用自助法(如果为True,那么采样会放回,也就是说数据集中可能出现相同的数据)n_jobs
制定多少核来训练训练随机森林的回归模型
RandomForestRegressor
random_forest_regressor.py
# Load libraries
from sklearn.ensemble import RandomForestRegressor
from sklearn import datasets
# Load data with only two features
boston = datasets.load_boston()
features = boston.data[:,0:2]
target = boston.target
# Create random forest classifier object
randomforest = RandomForestRegressor(random_state=0, n_jobs=-1)
# Train model
model = randomforest.fit(features, target)
RandomForestClassifier
和DecisionTreeClassifier
和RandomForestRegressor
和DecisionTreeRegressor
的关系:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets
# 莺尾花数据集
iris = datasets.load_iris()
features = iris.data
target = iris.target
# 随机森林分类器
randomforest = RandomForestClassifier(random_state=0, n_jobs=-1)
# 模型训练
model = randomforest.fit(features, target)
# 计算特征的重要程度
importances = model.feature_importances_
# 排序
indices = np.argsort(importances)[::-1]
# 重排
names = [iris.feature_names[i] for i in indices]
plt.figure()
plt.title("Feature Importance")
plt.bar(range(features.shape[1]), importances[indices])
# 添加特征名到X轴
plt.xticks(range(features.shape[1]), names, rotation=0)
# 显示图
plt.show()
怎么结果还和书上有点不一样,可能还是有点随机性的
随机森林是可解释的,这意味着我们可以计算出什么样的特征是对于模型最重要的
scikit-learn的内部实现随机森林有两点需要注意:
feature_importances_访问模型各个特征的重要性
# 库
from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets
from sklearn.feature_selection import SelectFromModel
iris = datasets.load_iris()
features = iris.data
target = iris.target
# 随机僧林分类器
randomforest = RandomForestClassifier(random_state=0, n_jobs=-1)
# 设置阈值
selector = SelectFromModel(randomforest, threshold=0.3)
# 新的特征矩阵
features_important = selector.fit_transform(features, target)
# 训练
model = randomforest.fit(features_important, target)
SelectFromModel
通过指定阈值threshold
来筛选特征的重要性class_weight=balanced
# 库
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets
# 数据
iris = datasets.load_iris()
features = iris.data
target = iris.target
# 高度不平衡的数据(让除了0以外的类为1类)
features = features[40:, :]
target = target[40:]
# 除了0其他都是1
target = np.where((target == 0), 0, 1)
# Create random forest classifier object
randomforest = RandomForestClassifier(
random_state=0, n_jobs=-1, class_weight="balanced")
# 训练模型
model = randomforest.fit(features, target)
使用class_weight参数来为每个类加权
ω j = n k n j \omega_j = \frac{n}{kn_j} ωj=knjn
这样的话系数和该类的样本数成反比,那么小的类会获得更高的权重
控制树的大小
通过结构参数来控制
tree_size.py
# Load libraries
from sklearn.tree import DecisionTreeClassifier
from sklearn import datasets
# Load data
iris = datasets.load_iris()
features = iris.data
target = iris.target
# Create decision tree classifier object
decisiontree = DecisionTreeClassifier(random_state=0,
max_depth=None, # 最大深度
min_samples_split=2, # 内部有节点最小的样本数
min_samples_leaf=1, # 叶子节点最小的样本数
min_weight_fraction_leaf=0,
max_leaf_nodes=None, # 最大叶子节点数
min_impurity_decrease=0) # 最小脏度
# Train model
model = decisiontree.fit(features, target)
主要介绍了scikit-learn’s DecisionTreeClassifier
(Regessor)的一些限制参数
max_depth
树的最大深度。None
就说明决策树的所有叶子都是纯类
min_samples_split
非叶子节点最小的样本数量。
min_samples_leaf
叶子节点最小的样本数量
max_leaf_nodes
最大叶子节点数量数量
min_impurity_split
节点分裂所需的最小脏度(这里应该指的是诸如交叉信息熵等)
性能更优秀的学习模型
AdaBoostClassifier
AdaBoostRegressor
boosting.py
# Load libraries
from sklearn.ensemble import AdaBoostClassifier
from sklearn import datasets
# Load data
iris = datasets.load_iris()
features = iris.data
target = iris.target
# Create adaboost tree classifier object
adaboost = AdaBoostClassifier(random_state=0)
# Train model
model = adaboost.fit(features, target)
相对于随机森林,提升(集成)(Boosting)往往有更好的性能
常用的boosting方法叫做AdaBoost
AdaBoost:
这里感觉原文并没有写很清楚,实际上有两个权值,弱模型和样本都有权值
样本权值更新:
对于分类错误的样本,加大其对应的权重;而对于分类正确的样本,降低其权重,这样分错的样本就被突显出来,从而得到一个新的样本分布。
弱分类器权值更新:
对于准确率较高的弱分类器,加大其权重;对于准确率较低的弱分类器,减小其权重。
AdaBoostClassifier
AdaBoostRegressor
out-of-bag
评分# Load libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets
# Load data
iris = datasets.load_iris()
features = iris.data
target = iris.target
# Create random tree classifier object
randomforest = RandomForestClassifier(
random_state=0, n_estimators=1000, oob_score=True, n_jobs=-1)
# Train model
model = randomforest.fit(features, target)
# View out-of-bag-error
randomforest.oob_score_
oob_score=True
来设置随机森林使用OOB评估,通过访问randomforest.oob_score_
查看最后的oob得分