引言:
机器学习分类下的文章以南京大学教授周志华的《机器学习》西瓜书作为理论指导,所以该分类下的文章对理论知识不在赘述,以《Python机器学习基础教程》作为实践指导,这两本书籍都是机器学习经典教程。
此类文章注重以代码实现的方式,对机器学习比较知名的工具进行学习。不是每个人都能成为理论大佬,请保持对自身能力的正确认识,来学习机器学习。
scikit-learn 是一个非常流行的工具,也是最有名的 Python 机器学习库。
除了 NumPy 和 SciPy,我们还会用到 pandas 和 matplotlib。
from sklearn import datasets
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
"""
step1导入数据
"""
iris = datasets.load_iris()
# 输出iris数据的键值
print("key for iris:\n", iris.keys())
# 输出前五行数据
print("data[:5] for iris:\n", iris['data'][:5])
# 输出特征描述
print("feature name:\n", iris['feature_names'])
# 输出目标值
print("target shape:\n", iris['target'].shape)
# 输出目标描述
print("target names:\n", iris['target_names'])
"""
step2训练数据与测试数据
一部分数据用于构建机器学习模型,叫作训练集(training set)
其余的数据用来评估模型性能,叫作测试集(test set)
"""
# train_test_split函数
X_train, X_test, y_train, y_test = train_test_split(iris['data'], iris['target'], random_state=0)
# 输出X_train,X_test shape
print("X_train:", X_train.shape)
print("X_test:", X_test.shape)
# 数据显示
# iris_dataframe = pd.DataFrame(X_train, columns=iris.feature_names)
# grr = pd.plotting.scatter_matrix(iris_dataframe, c=y_train, figsize=(15, 15), marker='o',
# hist_kwds={'bins': 20}, s=60, alpha=.8)
# plt.show()
"""
step3 K临近算法
要对一个新的数据点做出预测,算法会在训练集中寻找与这个新数据点距离最近
的数据点,然后将找到的数据点的标签赋值给这个新数据点
"""
# k临近算法,设置邻居数目为1
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)
# 预测,输入必须是二维数组
X_new = np.array([[5, 2.9, 1, 0.2]])
prediction = knn.predict(X_new)
print("Prediction :", prediction)
print("Prediction target name: {}".format(iris['target_names'][prediction]))
"""
step4 评估模型
通过计算精度(accuracy)来衡量模型的优劣,精度就是品种预
测正确的花所占的比例
"""
y_pred = knn.predict(X_test)
print("Test set prediction:", y_pred)
print("Test set score:{:.2f}%".format(np.mean(y_pred == y_test) * 100))
数据显示:
运行输出
D:\python\python.exe "D:/PythonProject/MLstudy/sklearn/supervised learning/chapter_1/iris_test.py"
key for iris:
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])
data[:5] for iris:
[[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5. 3.6 1.4 0.2]]
feature name:
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
target shape:
(150,)
target names:
['setosa' 'versicolor' 'virginica']
X_train: (112, 4)
X_test: (38, 4)
Prediction : [0]
Prediction target name: ['setosa']
Test set prediction: [2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0
2]
Test set score:97.37%
Process finished with exit code 0