* 从sklearn中导入决策树模型,导入鸢尾花数据集
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris # 数据集
from sklearn import tree # 决策树
from sklearn.model_selection import train_test_split
iris = load_iris()
X_train,X_test,y_train,y_test = train_test_split(iris.data,iris.target,test_size = 0.20,random_state = 20)
X_train 划分出的训练集数据(返回值)
X_test 划分出的测试集数据(返回值)
y_train 划分出的训练集标签(返回值)
y_test 划分出的测试集标签(返回值)
test_size:若在0~1之间,为测试集样本数目与原始样本数目之比;若为整数,则是测试集样本的数目。
test_size = 0.20 测试集占比0.20这里的random_state就是为了保证程序每次运行都分割一样的训练集和测试集。
否则,同样的算法模型在不同的训练集和测试集上的效果不一样。
https://www.jianshu.com/p/4deb2cb2502f
小贴士:
iris是一个字典,包含了数据、标签、标签名、数据描述等信息。可以通过键来索引对应的值
# 查看iris字典里的所有键
dir(iris)
['DESCR', 'data', 'feature_names', 'filename', 'target', 'target_names']
1、DESCR
# 鸢尾花数据集的描述说明信息
print(iris.DESCR)
:Number of Instances: 150 (50 in each of three classes) :Number of Attributes: 4 numeric, predictive attributes and the class :Attribute Information: - sepal length in cm - sepal width in cm - petal length in cm - petal width in cm - class: - Iris-Setosa - Iris-Versicolour - Iris-Virginica :Summary Statistics: ============== ==== ==== ======= ===== ==================== Min Max Mean SD Class Correlation ============== ==== ==== ======= ===== ==================== sepal length: 4.3 7.9 5.84 0.83 0.7826 sepal width: 2.0 4.4 3.05 0.43 -0.4194 petal length: 1.0 6.9 3.76 1.76 0.9490 (high!) petal width: 0.1 2.5 1.20 0.76 0.9565 (high!) ============== ==== ==== ======= ===== ====================
2、data
150个数据,每个数据都有四个维度的特征,每个特征都是连续数值
iris.data
array([[5.1, 3.5, 1.4, 0.2],
[6.5, 3. , 5.2, 2. ],
.......[6.2, 3.4, 5.4, 2.3],
[5.9, 3. , 5.1, 1.8]])
3、feature_names
四个特征的列名 分别是:花萼长度,花萼宽度、花瓣长度、花瓣宽度
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
iris.feature_names
4、iris.filename
'D:\\anaconda\\lib\\site-packages\\sklearn\\datasets\\data\\iris.csv'
5、target
标签0,1,2对应三种不同的鸢尾花
iris.target
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
6、 target_names
'setosa', 'versicolor', 'virginica'对应3种鸢尾花名字
iris.target_names
array(['setosa', 'versicolor', 'virginica'], dtype='
clf = tree.DecisionTreeClassifier() # 决策树分类器
# criterion 默认为‘gini’
clf = clf.fit(X_train,y_train)
plt.figure(dpi=200)
# feature_names=iris.feature_names设置决策树中显示的特征名称
tree.plot_tree(clf,feature_names=iris.feature_names,class_names=iris.target_names)
花瓣长度 <= 2.6 -->class = setosa
假如有一朵新的鸢尾花,四个特征分别为6,5,5,2。用训练好的决策树判断他是属于哪一类鸢尾花?
1、
print('数据[6 5 5 2]data类别:',clf.predict([[6,5,5,2]]))
数据[6 5 5 2]data类别: [2]
2、
print('测试集的标签:\n',y_test)
[0 1 1 2 1 1 2 0 2 0 2 1 2 0 0 2 0 1 2 1 1 2 2 0 1 1 1 0 2 2]
3、
print('模型的准确率为:','{0:.3f}'.format(clf.score(X_test,y_test)))
模型的准确率为: 0.933
小收获:
import numpy as np
a1 = np.array([[6,5,5,2],[1,2,3,4]]) # (2, 4)
print(a1)
a2 = np.array([6,5,5,2])
print(a2) # [6 5 5 2]
a2.shape # (4,)
a3 = a2.reshape(1,4)
print(a3.shape) # (1, 4)
print(a3) # [[6 5 5 2]]
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris # 数据集
from sklearn import tree # 决策树
from sklearn.model_selection import train_test_split
iris = load_iris()
X_train,X_test,y_train,y_test = train_test_split(iris.data,iris.target,test_size = 0.20,random_state = 20)
clf = tree.DecisionTreeClassifier() # 决策树分类器
clf = clf.fit(X_train,y_train)
plt.figure(dpi=200)
tree.plot_tree(clf,feature_names=iris.feature_names,class_names=iris.target_names)
print('数据[6 5 5 2]data类别:',clf.predict([[6,5,5,2]]))