知识点
sklearn的数据集在datasets模块,自带的数据集以"load_"开头
加载的iris数据集是可以字典类型使用
对应的Keys包含 [‘data’, ‘target’, ‘frame’, ‘target_names’, ‘DESCR’, ‘feature_names’, ‘filename’]
from sklearn.datasets import load_iris
# 1-load_iris加载数据集
iris = load_iris()
# 查看包含的keys
iris.keys()
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])
iris['filename']
'D:\\Anaconda3\\lib\\site-packages\\sklearn\\datasets\\data\\iris.csv'
“feature_names”:表示数据data每列的特征值的名称
“target_names”:分类目标对应的名称
# 查看数据的列名,分类目标的名称
iris['feature_names']
['sepal length (cm)',
'sepal width (cm)',
'petal length (cm)',
'petal width (cm)']
# DataFrame
import pandas as pd
pd.DataFrame(data=iris['data'],columns=iris['feature_names'])
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | |
---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 |
1 | 4.9 | 3.0 | 1.4 | 0.2 |
2 | 4.7 | 3.2 | 1.3 | 0.2 |
3 | 4.6 | 3.1 | 1.5 | 0.2 |
4 | 5.0 | 3.6 | 1.4 | 0.2 |
... | ... | ... | ... | ... |
145 | 6.7 | 3.0 | 5.2 | 2.3 |
146 | 6.3 | 2.5 | 5.0 | 1.9 |
147 | 6.5 | 3.0 | 5.2 | 2.0 |
148 | 6.2 | 3.4 | 5.4 | 2.3 |
149 | 5.9 | 3.0 | 5.1 | 1.8 |
通过字典的键值可以直接获取到对应的数据
data和target对应的数据类型是numpy的ndarry类型,可以用shape获取其大小
# 分类目标的名称
iris['target_names']
array(['setosa', 'versicolor', 'virginica'], dtype='
# 获取data和target,并打印各自的shape
data = iris['data']
print(type(data),data.shape)
target = iris['target']
print(type(target),target.shape)
(150, 4)
(150,)
在model_selection模块中使用train_test_split对数据集进行训练集和测试集的划分
from sklearn.model_selection import train_test_split
'''
第一个参数:数据集
第二个参数:目标集
第三个参数:测试集所占比例
'''
data_train,data_test,target_train,target_test = \
train_test_split(data,target,test_size=0.3)
data_train.shape
(105, 4)
使用模型:linear_modeld的LogisticRegression
步骤:
- 导入模块linear_modeld.LogisticRegression
- 初始化模型 LogisticRegression()
- 训练fit()
- 查看分数(效果)score()
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=1000) # 定义最大迭代次数
model.fit(data_train,target_train) # 进行训练
LogisticRegression(max_iter=1000)
# 查看在训练集上评分
model.score(data_train,target_train)
0.9619047619047619
# 查看在测试集上的评分
model.score(data_test,target_test)
0.9555555555555556
预测,使用模型提供的predict方法执行预测
LogisticRegression(max_iter=1000)
target_predict = model.predict(data_test)
import pandas as pd
df = pd.DataFrame(target_predict,columns=["预测结果"])
df['实际结果'] = target_test
df.shape #(45, 2)
衡量预测结果的好坏
使用metrics.confusion_matrix
from sklearn.metrics import confusion_matrix
# 输出混淆矩阵
confusion_matrix(target_test,target_predict)
array([[13, 0, 0],
[ 0, 14, 1],
[ 0, 1, 16]], dtype=int64)
# 查看分类错误的数据
df.loc[df['实际结果']==0]
预测结果 | 实际结果 | |
---|---|---|
0 | 0 | 0 |
2 | 0 | 0 |
11 | 0 | 0 |
13 | 0 | 0 |
18 | 0 | 0 |
20 | 0 | 0 |
22 | 0 | 0 |
25 | 0 | 0 |
31 | 0 | 0 |
33 | 0 | 0 |
39 | 0 | 0 |
40 | 0 | 0 |
44 | 0 | 0 |
from sklearn.metrics import classification_report
# 输出混淆矩阵
print(classification_report(target_test,target_predict,
target_names=iris['target_names']))
precision recall f1-score support
setosa 1.00 1.00 1.00 13
versicolor 0.93 0.93 0.93 15
virginica 0.94 0.94 0.94 17
accuracy 0.96 45
macro avg 0.96 0.96 0.96 45
weighted avg 0.96 0.96 0.96 45