【阿旭机器学习实战】系列文章主要介绍机器学习的各种算法模型及其实战案例,欢迎点赞,关注共同学习交流。
本文通过构建决策树模型,对某树叶分类数据集进行建模预测,并进行模型优化。
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
data = pd.read_csv('train.csv')
data.head()
id | species | margin1 | margin2 | margin3 | margin4 | margin5 | margin6 | margin7 | margin8 | ... | texture55 | texture56 | texture57 | texture58 | texture59 | texture60 | texture61 | texture62 | texture63 | texture64 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Acer_Opalus | 0.007812 | 0.023438 | 0.023438 | 0.003906 | 0.011719 | 0.009766 | 0.027344 | 0.0 | ... | 0.007812 | 0.000000 | 0.002930 | 0.002930 | 0.035156 | 0.0 | 0.0 | 0.004883 | 0.000000 | 0.025391 |
1 | 2 | Pterocarya_Stenoptera | 0.005859 | 0.000000 | 0.031250 | 0.015625 | 0.025391 | 0.001953 | 0.019531 | 0.0 | ... | 0.000977 | 0.000000 | 0.000000 | 0.000977 | 0.023438 | 0.0 | 0.0 | 0.000977 | 0.039062 | 0.022461 |
2 | 3 | Quercus_Hartwissiana | 0.005859 | 0.009766 | 0.019531 | 0.007812 | 0.003906 | 0.005859 | 0.068359 | 0.0 | ... | 0.154300 | 0.000000 | 0.005859 | 0.000977 | 0.007812 | 0.0 | 0.0 | 0.000000 | 0.020508 | 0.002930 |
3 | 5 | Tilia_Tomentosa | 0.000000 | 0.003906 | 0.023438 | 0.005859 | 0.021484 | 0.019531 | 0.023438 | 0.0 | ... | 0.000000 | 0.000977 | 0.000000 | 0.000000 | 0.020508 | 0.0 | 0.0 | 0.017578 | 0.000000 | 0.047852 |
4 | 6 | Quercus_Variabilis | 0.005859 | 0.003906 | 0.048828 | 0.009766 | 0.013672 | 0.015625 | 0.005859 | 0.0 | ... | 0.096680 | 0.000000 | 0.021484 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.031250 |
5 rows × 194 columns
数据说明:
species类别,64个margin边缘特征,64个shape形状特征,64个texture质感特征
一共有99个树叶类别
data.shape
(990, 194)
# 查看树叶类别数
len(data.species.unique())
99
# 把字符串类别转化为数字形式
lb = LabelEncoder().fit(data.species)
labels = lb.transform(data.species)
# 去掉'species', 'id'这两列对于训练模型无用的列
data = data.drop(['species', 'id'], axis=1)
data.head()
margin1 | margin2 | margin3 | margin4 | margin5 | margin6 | margin7 | margin8 | margin9 | margin10 | ... | texture55 | texture56 | texture57 | texture58 | texture59 | texture60 | texture61 | texture62 | texture63 | texture64 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.007812 | 0.023438 | 0.023438 | 0.003906 | 0.011719 | 0.009766 | 0.027344 | 0.0 | 0.001953 | 0.033203 | ... | 0.007812 | 0.000000 | 0.002930 | 0.002930 | 0.035156 | 0.0 | 0.0 | 0.004883 | 0.000000 | 0.025391 |
1 | 0.005859 | 0.000000 | 0.031250 | 0.015625 | 0.025391 | 0.001953 | 0.019531 | 0.0 | 0.000000 | 0.007812 | ... | 0.000977 | 0.000000 | 0.000000 | 0.000977 | 0.023438 | 0.0 | 0.0 | 0.000977 | 0.039062 | 0.022461 |
2 | 0.005859 | 0.009766 | 0.019531 | 0.007812 | 0.003906 | 0.005859 | 0.068359 | 0.0 | 0.000000 | 0.044922 | ... | 0.154300 | 0.000000 | 0.005859 | 0.000977 | 0.007812 | 0.0 | 0.0 | 0.000000 | 0.020508 | 0.002930 |
3 | 0.000000 | 0.003906 | 0.023438 | 0.005859 | 0.021484 | 0.019531 | 0.023438 | 0.0 | 0.013672 | 0.017578 | ... | 0.000000 | 0.000977 | 0.000000 | 0.000000 | 0.020508 | 0.0 | 0.0 | 0.017578 | 0.000000 | 0.047852 |
4 | 0.005859 | 0.003906 | 0.048828 | 0.009766 | 0.013672 | 0.015625 | 0.005859 | 0.0 | 0.000000 | 0.005859 | ... | 0.096680 | 0.000000 | 0.021484 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.031250 |
5 rows × 192 columns
labels[:5]
array([ 3, 49, 65, 94, 84], dtype=int64)
# 切分数据集
x_train,x_test,y_train,y_test = train_test_split(data, labels, test_size=0.2, stratify=labels)
tree = DecisionTreeClassifier()
tree.fit(x_train, y_train)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best')
tree.score(x_test, y_test)
0.6767676767676768
tree.score(x_train, y_train)
1.0
结果表明该模型在训练集准确率为100%,而在测试集准确率仅有67%,存在过拟合现象,模型需要进一步优化。
# max_depth:树的最大深度
# min_samples_split:内部节点再划分所需最小样本数
# min_samples_leaf:叶子节点最少样本数
param_grid = {'max_depth': [10,15,20,25,30],
'min_samples_split': [2,3,4,5,6,7,8],
'min_samples_leaf':[1,2,3,4,5,6,7]}
# 网格搜索
model = GridSearchCV(tree, param_grid, cv=3)
model.fit(x_train, y_train)
print(model.best_estimator_)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=30,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=4, min_samples_split=5,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best')
model.score(x_train, y_train)
0.9444444444444444
model.score(x_test, y_test)
0.6868686868686869
如果内容对你有帮助,感谢点赞+关注哦!
欢迎关注我的公众号:
阿旭算法与机器学习
,共同学习交流。
更多干货内容持续更新中…