以下内容转载自两篇博文,具体地址直接百度可以看到~
1.交叉验证可以用来确定模型的参数
一般来说准确率(accuracy)会用于判断分类(Classification)模型的好坏。
#coding:utf-8
from sklearn.datasets import load_iris # iris数据集
from sklearn.model_selection import train_test_split # 分割数据模块
from sklearn.neighbors import KNeighborsClassifier # K最近邻(kNN,k-NearestNeighbor)分类算法
from sklearn.model_selection import cross_val_score # K折交叉验证模块
import matplotlib.pyplot as plt #可视化模块
iris = load_iris()
X = iris.data
y = iris.target
#建立测试参数集
k_range = range(1, 31)
k_scores = []
#藉由迭代的方式来计算不同参数对模型的影响,并返回交叉验证后的平均准确率
for k in k_range:
knn = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')
k_scores.append(scores.mean())
#可视化数据
plt.plot(k_range, k_scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Cross-Validated Accuracy')
plt.show()
一般来说平均方差(Mean squared error)会用于判断回归(Regression)模型的好坏。
for k in k_range:
knn = KNeighborsClassifier(n_neighbors=k)
loss = -cross_val_score(knn, X, y, cv=10, scoring='neg_mean_squared_error')
k_scores.append(loss.mean())
plt.plot(k_range, k_scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Cross-Validated MSE')
plt.show()
2.检视过拟合
验证SVC中的一个参数 gamma 在什么范围内能使 model 产生好的结果. 以及过拟合和 gamma 取值的关系.
#coding:utf-8
from sklearn.model_selection import validation_curve
from sklearn.datasets import load_digits
from sklearn.svm import SVC
import matplotlib.pyplot as plt
import numpy as np
#digits数据集
digits = load_digits()
X = digits.data
y = digits.target
#建立参数测试集
param_range = np.logspace(-6, -2.3, 5)
#使用validation_curve快速找出参数对模型的影响
train_loss, test_loss = validation_curve(
SVC(), X, y, param_name='gamma', param_range=param_range, cv=10, scoring='neg_mean_squared_error')
#平均每一轮的平均方差
train_loss_mean = -np.mean(train_loss, axis=1)
test_loss_mean = -np.mean(test_loss, axis=1)
#可视化图形
plt.plot(param_range, train_loss_mean, 'o-', color="r",
label="Training")
plt.plot(param_range, test_loss_mean, 'o-', color="g",
label="Cross-validation")
plt.xlabel("gamma")
plt.ylabel("Loss")
plt.legend(loc="best")
plt.show()
3.用于模型的选择
交叉验证也可以帮助我们进行模型选择,以下是一组例子,分别使用iris数据,KNN和logistic回归模型进行模型的比较和选择。
In [13]:
# 10-fold cross-validation with the best KNN model
knn = KNeighborsClassifier(n_neighbors=20)
print cross_val_score(knn, X, y, cv=10, scoring='accuracy').mean()
0.98
In [14]:
# 10-fold cross-validation with logistic regression
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
print cross_val_score(logreg, X, y, cv=10, scoring='accuracy').mean()
0.953333333333
4.用于特征选择
下面我们使用advertising数据,通过交叉验证来进行特征的选择,对比不同的特征组合对于模型的预测效果。
In [15]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
In [16]:
# read in the advertising dataset
data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0)
In [17]:
# create a Python list of three feature names
feature_cols = ['TV', 'Radio', 'Newspaper']
# use the list to select a subset of the DataFrame (X)
X = data[feature_cols]
# select the Sales column as the response (y)
y = data.Sales
In [18]:
# 10-fold cv with all features
lm = LinearRegression()
scores = cross_val_score(lm, X, y, cv=10, scoring='mean_squared_error')
print scores
[-3.56038438 -3.29767522 -2.08943356 -2.82474283 -1.3027754 -1.74163618
-8.17338214 -2.11409746 -3.04273109 -2.45281793]
这里要注意的是,上面的scores都是负数,为什么均方误差会出现负数的情况呢?因为这里的mean_squared_error是一种损失函数,优化的目标的使其最小化,而分类准确率是一种奖励函数,优化的目标是使其最大化。
In [19]:
# fix the sign of MSE scores
mse_scores = -scores
print mse_scores
[ 3.56038438 3.29767522 2.08943356 2.82474283 1.3027754 1.74163618
8.17338214 2.11409746 3.04273109 2.45281793]
In [20]:
# convert from MSE to RMSE
rmse_scores = np.sqrt(mse_scores)
print rmse_scores
[ 1.88689808 1.81595022 1.44548731 1.68069713 1.14139187 1.31971064
2.85891276 1.45399362 1.7443426 1.56614748]
In [21]:
# calculate the average RMSE
print rmse_scores.mean()
1.69135317081
In [22]:
# 10-fold cross-validation with two features (excluding Newspaper)
feature_cols = ['TV', 'Radio']
X = data[feature_cols]
print np.sqrt(-cross_val_score(lm, X, y, cv=10, scoring='mean_squared_error')).mean()
1.67967484191
由于不加入Newspaper这一个特征得到的分数较小(1.68 < 1.69),所以,使用所有特征得到的模型是一个更好的模型。