机器学习大火,忽然间很多人都朝这里使劲,我是个研一的学生,这并不是我的专业方向,出于种种原因,我也来了。
小白一枚,从零自学,经两个月跌跌撞撞,这里一锤子那里一棒子的学习,确定了前期的学习路线并开博客,与君共勉。
本系列博客主要参考《利用Python进行数据分析》、《Python数据挖掘入门与实践》、《机器学习》(周志华)。以后两本为主线学习。
第一本书作为工具书,用于补充Python、Pandas等背景知识;
第二本书作为实践书,主要利用scikit-learn练习算法的使用和调参等等;
第三本书作为理论书,结合第二本加强对算法的理解。
当然内功还需线性代数、概率论等。本人尝试过先过一遍数学,可没有不经过实践的理论转身就忘,所以会在读这三本书的同时穿插数学基础。
先依托简单的K近邻算法,熟悉最简单的scikit-learn使用框架,讲解都在代码的注释中。
为便于理解,我把“导入库”的语句都写在了距离“调用库”语句最近的上方。
import numpy as np
import os
data_filename = os.path.join("C:\Users\Han Chunhui", "Ionosphere","ionosphere.data")#import dataset
x = np.zeros((351, 34), dtype='float')#create space for data ,351 rows and 34 columns
y = np.zeros((351,), dtype='bool')#create space for labels ,351 rows and 1 column
import csv
with open(data_filename, 'r') as input_file:
reader = csv.reader(input_file)
for i, row in enumerate(reader):
data = [float(datum) for datum in row[:-1]]
x[i] = data
y[i] = row[-1] == 'g'#x is the set of data , y is the set of labels
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=14)#split the set of train and test
from sklearn.neighbors import KNeighborsClassifier
estimator = KNeighborsClassifier()
estimator.fit(x_train, y_train)#"KNeighborsClassifier" is a object ,fit is a method of this object to train the set of train
y_predicted = estimator.predict(x_test)#predict is a method of this object to predict the result of the set of test
accuracy = np.mean(y_test == y_predicted) * 100#compare the result with the fact ,and we get the accuracy.
print("The accuracy is {0:.1f}%".format(accuracy))#print is "The accuracy is 86.4%"
接下来使用“交叉验证”方法测试算法性能,简单来说“交叉验证”就是在同一数据集中多次切分出不同的训练集和测试集。
from sklearn.cross_validation import cross_val_score
scores = cross_val_score(estimator, x, y, scoring='accuracy') #cross validation
average_accuracy = np.mean(scores) * 100
print("The average accuracy is {0:.1f}%".format(average_accuracy))#print is "The average accuracy is 82.3%"
以上使用的是默认参数(即K近邻算法中的近邻个数为默认),接下来人为调整参数看看不同参数的效果。
avg_scores = []#save the results produced by all parameters.
all_scores = []
parameter_values = list(range(1, 21)) #change parameter from 1 to 20
for n_neighbors in parameter_values:
estimator = KNeighborsClassifier(n_neighbors=n_neighbors)#change parameter
scores = cross_val_score(estimator, x, y, scoring='accuracy')#cross validation
avg_scores.append(np.mean(scores))#add result to avg_scores
all_scores.append(scores)
from matplotlib import pyplot as plt
plt.plot(parameter_values,avg_scores, '-o')#draw the results
plt.show()
通常数据集并不规整,需要进行一系列数据预处理,如最基本的:归一化。像数据预处理这样的步骤常常是一系列并且固定不变的,我们为了使用方便并避免错放顺序,可使用“流水线”对“步骤们”进行封装,就像一个函数封装(代表)了一系列操作一样。使用scikit-learn的步骤稍升级的框架为:
X_broken = np.array(x)
X_broken[:,::2] /= 10#every other line, divide the second feature values by 10
from sklearn.preprocessing import MinMaxScaler#normalization [0~1]
from sklearn.pipeline import Pipeline
scaling_pipeline = Pipeline([('scale', MinMaxScaler()),('predict', KNeighborsClassifier())])#create pipeline including normalization and classifier
scores = cross_val_score(scaling_pipeline, X_broken, y,scoring='accuracy')#cross validation
print("The pipeline scored an average accuracy for is {0:.1f}%".format(np.mean(scores) * 100))#print is "The pipeline scored an average accuracy for is 82.3%"
代码来自于《Python数据挖掘入门与实践》,已验证。运行环境:PyCharm Python2.7