上了一门机器学习课,实践平台老师推荐了Python和scikit-learn库。scikit-learn库包含有完善的文档和丰富的机器学习算法,在官方文档上每种算法都有讲解和应用示例(简直堪比老师课上的PPT)。
于是调查了一下这个库,目的是学习下它怎么用。
第一步自然是数据加载,可以在UCIMachine Learning Repository网站上load,这个网站是个公开的机器学习数据集库,资源来自各种学校各种单位各种实验室各种数据库的贡献。数据集都不大,可以用来练习ML算法。
python是个强大的东西,我们可以直接用urllib从网站上load数据,再用numpy的函数加载:
(这里下的是经典的鸢尾花数据集,共150个data,分3类,每类50例,每例4个属性和1个类别标签)
import numpy as np
from urllib import request
# UCI dataset url
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
raw_data = request.urlopen(url)
x = np.loadtxt(raw_data, delimiter=",", usecols=(0,1,2,3))
raw_data = request.urlopen(url)
y = np.loadtxt(raw_data, delimiter=",", usecols=(4), dtype=str)
注:raw_data是网页请求的response内容,只能读取一次,所以y要再request一次。若想一次读取就要把data存到本地文件再加载。
# UCI dataset url
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
raw_data = request.urlopen(url)
page = raw_data.read()
page = page.decode('utf-8')
localFile = open('iris.csv','w')
localFile.write(page)
localFile.close()
x = np.loadtxt('iris.csv', delimiter=',', usecols=(0,1,2,3))
y = np.loadtxt('iris.csv', delimiter=',', usecols=(4), dtype=str)
另外,scikit-learn里也封装了加载lris这个经典数据集的方法,可以看下是怎样的,可是它封装得太好了所以并没有通用性:
from sklearn.datasets import load_iris
iris = load_iris()
x, y = iris.data, iris.target
大多数的梯度算法对数据的缩放都很敏感,所以在进行算法之前先对数据集做标准化(Normalization,让这些独立的样本具有统一的范数),或者规范化(Standardization,使数据所有的特征都有0期望和1方差)。
from sklearn import preprocessing
# normalize the data attributes
normalized_x = preprocessing.normalize(x)
# standardize the data attributes
standardized_x = preprocessing.scale(x)
机器学习很重要的一步是数据特征的选取,它对算法应用的效果影响较大。
• 基于L1范数的特征提取(L1-based feature selection):
from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel
...
lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(x, y)
model = SelectFromModel(lsvc, prefit=True)
x_new = model.transform(x)
x是(150,4),x_new是(150,3)。
• 树算法提取特征(Tree-based feature selection):from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectFromModel
...
clf = ExtraTreesClassifier()
clf = clf.fit(x, y)
print(clf.feature_importances_)
model = SelectFromModel(clf, prefit=True)
x_new = model.transform(x)
x是(150,4),x_new是(150,2)。
• 递归特征消除算法(RFE,recursive feature elimination)提取特征,对特征集的搜索找到最好的子集:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
...
model = LogisticRegression()
# create the RFE model and select 3 attributes
rfe = RFE(model, 3)
rfe = rfe.fit(x, y)
# summarize the selection of the attributes
print(rfe.support_)
print(rfe.ranking_)
著名机器学习算法,逻辑回归大多数情况下被用来解决分类问题(二元分类),但多分类也适用。这个算法的优点是对于每一个输出的对象都有一个对应类别的概率。
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
...
model = LogisticRegression()
model.fit(x, y)
# make predictions
expected = y
predicted = model.predict(x)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))
它也是最有名的机器学习算法之一,它的主要任务是恢复训练样本的数据分布密度。这个方法通常在多类的分类问题上表现的很好。
from sklearn import metrics
from sklearn.naive_bayes import GaussianNB
...
model = GaussianNB()
model.fit(x, y)
# make predictions
expected = y
predicted = model.predict(x)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))
knn方法通常用于更复杂分类算法中,比如用它的估计值做为一个对象的特征。此外,这个算法在回归问题中通常表现出最好的质量。
from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier
...
model = KNeighborsClassifier()
model.fit(x, y)
# make predictions
expected = y
predicted = model.predict(x)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))
分类和回归树(CART)经常被用于这么一类问题,在这类问题中对象有可分类的特征且被用于回归和分类问题。决策树很适用于多类分类。
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
...
model = DecisionTreeClassifier()
model.fit(x, y)
# make predictions
expected = y
predicted = model.predict(x)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))
SVM是最流行的机器学习算法之一,它主要用于分类问题,可以实现多类分类。同样也适用于逻辑回归。
from sklearn import metrics
from sklearn.svm import SVC
...
model = SVC()
model.fit(x, y)
# make predictions
expected = y
predicted = model.predict(x)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))
除了分类和回归,scikit-learn还有很多更复杂的算法,包括聚类、建立混合算法如Bagging和Boosting。
没有经验的情况下,scikit-learn也有提供函数寻找模型最优的参数。举个规则化参数选择的例子,可以在给定区间遍历数值,而有时随机取数效果更好:
import numpy as np
from sklearn.linear_model import Ridge
from sklearn.grid_search import GridSearchCV
...
# prepare a range of alpha values to test
alphas = np.array([1,0.1,0.01,0.001,0.0001,0])
# create and fit a ridge regression model, testing each alpha
model = Ridge()
search = GridSearchCV(estimator=model, param_grid=dict(alpha=alphas))
search.fit(x, y)
# 比较参数search的结果
print(search.best_score_)
print(search.best_estimator_.alpha)
用search.best_score_找到最优参数alpha
import numpy as np
from scipy.stats import uniform as sp_rand
from sklearn.linear_model import Ridge
from sklearn.grid_search import RandomizedSearchCV
...
# prepare a uniform distribution to sample for the alpha parameter
param_grid = {'alpha': sp_rand()}
# create and fit a ridge regression model, testing random alpha values
model = Ridge()
search = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=100)
search.fit(x, y)
# 比较随机取参数下算法的效果
print(search.best_score_)
print(search.best_estimator_.alpha)
scikit-learn是一个方便又强大的机器学习算法库,是一个非常优秀的学习ML的平台,怪不得很多教案都推荐它。
参考:
[1] http://scikit-learn.org/stable/user_guide.html