决策树(Decision Tree)
通过sklearn库的决策树模型对iris数据进行多分类,并进行结果评估
导入:
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn import datasets
from sklearn.datasets import load_breast_cancer
from sklearn import tree
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.model_selection import KFold
import sklearn
加载数据:
# 加载iris数据
iris = load_iris()
# x,y数据赋值
x, y = iris.data, iris.target
print('x:', x.shape)
print(x[:10,:])
print('y:', y.shape)
print(y)
运行结果如下:
x: (150, 4)
[[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5. 3.6 1.4 0.2]
[5.4 3.9 1.7 0.4]
[4.6 3.4 1.4 0.3]
[5. 3.4 1.5 0.2]
[4.4 2.9 1.4 0.2]
[4.9 3.1 1.5 0.1]]
y: (150,)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]
使用sklearn库对iris数据集进行乱序切分为训练集和测试集(7:3比例),并使用决策树模型对测试集进行分类,最后对分类结果的准确率、召回率进行评估:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test= train_test_split(iris.data, iris.target, test_size=0.3, random_state=0)
breast_cancer = load_breast_cancer()
data = breast_cancer.data
target = breast_cancer.target
target_names = ['class train', 'class test']
clf = tree.DecisionTreeClassifier(random_state=0)
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
# print(confusion_matrix(y_test, y_pred, labels=[1, 0])) # 混淆矩阵
print(classification_report(y_test, y_pred, target_names=target_names,labels=[1, 0]))
# 显示主要分类指标的文本报告
运行结果如下:
precision recall f1-score support
class train 1.00 0.94 0.97 18
class test 1.00 1.00 1.00 16
avg / total 1.00 0.97 0.98 34
使用sklearn库的决策树模型对iris数据集进行10折交叉验证,评估每折的正确率,并计算平均准确率:
from sklearn.model_selection import cross_val_score
import numpy as np
d_scores = cross_val_score(DecisionTreeClassifier(), iris.data, iris.target, cv=10)
for i in range(len(d_scores)):
print("第",i+1,"折的准确率 ",format(d_scores[i],'.4f'))
print("平均准确率: ",np.average(d_scores))
运行结果如下:
第 1 折的准确率 1.0000
第 2 折的准确率 0.9333
第 3 折的准确率 1.0000
第 4 折的准确率 0.9333
第 5 折的准确率 0.9333
第 6 折的准确率 0.8667
第 7 折的准确率 0.9333
第 8 折的准确率 1.0000
第 9 折的准确率 1.0000
第 10 折的准确率 1.0000
平均准确率: 0.96
修改决策树模型中的参数(如criterion、max_depth、spliter等)评估10折交叉验证下的平均准确率,至少验证4组不同参数的决策树模型:
Dy = []
kf=KFold(n_splits=10,shuffle=True,random_state=True)
x=iris['data']
kiss = DecisionTreeClassifier(criterion='entropy', max_depth=30, splitter='best')
Accuracy=0
for train_index, test_index in kf.split(x):
x_train, x_test = x[train_index], x[test_index]
y_train, y_test = y[train_index], y[test_index]
clf2 = kiss.fit(x_train, y_train)
pre = clf2.predict(x_test)
This_Time = accuracy_score(y_test, pre, normalize = True )
Accuracy+=This_Time
Dy.append(This_Time)
print("第一次准确率",Accuracy/10)
kiss = DecisionTreeClassifier(criterion='gini', max_depth=20, splitter='random')
Accuracy=0
for train_index, test_index in kf.split(x):
x_train, x_test = x[train_index], x[test_index]
y_train, y_test = y[train_index], y[test_index]
clf2 = kiss.fit(x_train, y_train)
pre = clf2.predict(x_test)
This_Time = accuracy_score(y_test, pre, normalize = True )
Accuracy+=This_Time
Dy.append(This_Time)
print("第二次准确率",Accuracy/10)
kiss = DecisionTreeClassifier(criterion='entropy', max_depth=20, splitter='random')
Accuracy=0
for train_index, test_index in kf.split(x):
x_train, x_test = x[train_index], x[test_index]
y_train, y_test = y[train_index], y[test_index]
clf2 = kiss.fit(x_train, y_train)
pre = clf2.predict(x_test)
This_Time = accuracy_score(y_test, pre, normalize = True )
Accuracy+=This_Time
Dy.append(This_Time)
print("第三次准确率",Accuracy/10)
kiss = DecisionTreeClassifier(criterion='entropy', max_depth=30, splitter='random')
Accuracy=0
for train_index, test_index in kf.split(x):
x_train, x_test = x[train_index], x[test_index]
y_train, y_test = y[train_index], y[test_index]
clf2 = kiss.fit(x_train, y_train)
pre = clf2.predict(x_test)
This_Time = accuracy_score(y_test, pre, normalize = True )
Accuracy+=This_Time
Dy.append(This_Time)
print("第四次准确率",Accuracy/10)
运行结果如下:
第一次准确率 0.9600000000000002
第二次准确率 0.9200000000000002
第三次准确率 0.9466666666666669
第四次准确率 0.9466666666666669