SVM优点:
SVM缺点:
使用SVM预测模型的通用步骤
SVM种类,用途和关键参数表
主要分为三类:1、分类 2、回归 3、异常检测
TIPs:SVM模块包含2个库:libsvm和liblinear,拟合模型时,python和这两个库有数据流,会消耗一部分内存。
如果内存足够,最好能够把SVM的cashe_size参数设置大于200M,例如1000M。
参数问题:
一、SVC分类(手写字体识别问题)
#载入数据
from sklearn import datasets
digits = datasets.load_digits()
X,y = digits.data, digits.target
#画图展示
import matplotlib.pyplot as plt
#此处注意,k,img代表的数字在1-10之间,故enmerate()需要增加一个start=1的参数
for k,img in enumerate(range(10),start=1):
plt.subplot(2, 5, k)
plt.imshow(digits.images[img],cmap='binary',interpolation='none')
plt.show()
print(X.shape)
#X[0]
from scipy.stats import itemfreq
print(itemfreq(y))
TIPs如果遇到类不平衡的问题:
1、保持不平衡,偏向于评率高的类做预测
2、使用权重在类间建立平等,允许对观察点多次计数
3、删减拥有过多样例的勒种的一些样例
解决类不平衡问题的参数方法:
#数据区分测试集和训练集,并将数据标准化
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn.preprocessing import MinMaxScaler
# We keep 30% random examples for test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)
# We scale the data in the range [-1,1]
scaling = MinMaxScaler(feature_range=(-1, 1)).fit(X_train)
X_train = scaling.transform(X_train)
X_test = scaling.transform(X_test)
from sklearn.svm import SVC
# We balance the clasess so you can see how it works
learning_algo = SVC(kernel='linear', class_weight='balanced')
cv_performance = cross_val_score(learning_algo, X_train, y_train, cv=5)
test_performance = learning_algo.fit(X_train, y_train).score(X_test, y_test)
print ('Cross-validation accuracy score: %0.3f, test accuracy score: %0.3f' % (np.mean(cv_performance),test_performance))
数据结果显示精度很好
C参数默认1,下面使用网格搜索来选择最优的C参数
from sklearn.grid_search import GridSearchCV
learning_algo = SVC(class_weight='balanced', random_state=101)
#定义了C的一个列表,包含7个值,从10^-3到10^3
search_space = {'C': np.logspace(-3, 3, 7)}
gridsearch = GridSearchCV(learning_algo, param_grid=search_space, scoring='accuracy', refit=True, cv=10, n_jobs=-1)
gridsearch.fit(X_train,y_train)
cv_performance = gridsearch.best_score_
test_performance = gridsearch.score(X_test, y_test)
print ('Cross-validation accuracy score: %0.3f, test accuracy score: %0.3f' % (cv_performance,test_performance))
print ('Best C parameter: %0.1f' % gridsearch.best_params_['C'])
C取值为100的时候,模型最优。