1、特征提升
原始数据的种类有很多种,除了数字化的信号数据,还有大量符号化的文本,然而,我们无法直接将符号化的文字本身用于计算任务,而是需要通过某些处理手段,预先将文本量化为特征向量有些用符号表示的特征已经相对结构化,并且以文字这种数据结构进行存储,这时我们使用DictVectorizer对特征进行抽取和向量化,比如:
import pandas as pd
measurements = [
{'city':'Dubai','temperature':33},
{'city':'London','temperature':12},
{'city':'San Francisco','temperature':18}
]
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer()
print(vec.fit_transform(measurements).toarray())
print(vec.get_feature_names())
#输出
[[ 1. 0. 0. 33.]
[ 0. 1. 0. 12.]
[ 0. 0. 1. 18.]]
['city=Dubai', 'city=London', 'city=San Francisco', 'temperature']
另外一些文本数据更为直接,几乎没有使用特殊的数据结构进行存储,只是一系列字符串,我们处理这些数据,比较常用的文本特征表示方法是词袋法:顾名思义,不考虑词语出现的顺序,只是将训练文本中的每个出现过的词汇单独视作一系列特征。我们称这些不重复的词汇集合为词表,于是每条训练文本都可以在高维度的词表上映射出一个向量,而特征数值的计算常见有两种方法:CountVectorizer和TfidfVectorizer:
from sklearn.datasets import fetch_20newsgroups
news = fetch_20newsgroups(subset='all')
from sklearn.cross_validation import train_test_split
X_train,X_test,y_train,y_test = train_test_split(news.data,news.target,test_size=0.25,random_state=33)
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()
X_count_train = vec.fit_transform(X_train)
X_count_test = vec.transform(X_test)
from sklearn.naive_bayes import MultinomialNB
mnb = MultinomialNB()
mnb.fit(X_count_train,y_train)
y_count_predict = mnb.predict(X_count_test)
from sklearn.metrics import classification_report
print('The accruacy of Naive Bayes is ',mnb.score(X_count_test,y_test))
print(classification_report(y_test,y_count_predict,target_names=news.target_names))
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vec = TfidfVectorizer()
X_tfidf_train = tfidf_vec.fit_transform(X_train)
X_tfidf_test = tfidf_vec.transform(X_test)
from sklearn.naive_bayes import MultinomialNB
mnb = MultinomialNB()
mnb.fit(X_tfidf_train,y_train)
y_tfidf_predict = mnb.predict(X_tfidf_test)
from sklearn.metrics import classification_report
print('The accruacy of Naive Bayes is ',mnb.score(X_tfidf_test,y_test))
print(classification_report(y_test,y_tfidf_predict,target_names=news.target_names))
2、特征筛选
特征筛选与PCA这类通过选择主成分对特征进行重建的方法略有区别,对于PCA而言,我们经常无法解释重建之后的特征,但是特征筛选不存在对特征值的修改,而是更加侧重于寻找那些对模型性能提升较大的少量特征,我们使用sklearn的feature_selection模块中SelectPercentile方法,从效果较明显的特征开始,选择前百分之n的特征进行特征筛选。
titanic = pd.read_csv('http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt')
y = titanic['survived']
X = titanic.drop(['row.names','name','survived'],axis=1)
X['age'].fillna(X['age'].mean(),inplace=True)
X.fillna('UNKNOWN',inplace=True)
from sklearn.cross_validation import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=33)
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer()
X_train = vec.fit_transform(X_train.to_dict(orient='record'))
X_test = vec.transform(X_test.to_dict(orient='record'))
print (len(vec.feature_names_))
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(criterion='entropy')
dt.fit(X_train,y_train)
dt.score(X_test,y_test)
#选择前20%的特征
from sklearn import feature_selection
fs = feature_selection.SelectPercentile(feature_selection.chi2,percentile=20)
X_train_fs = fs.fit_transform(X_train,y_train)
dt.fit(X_train_fs,y_train)
X_test_fs = fs.transform(X_test)
dt.score(X_test_fs,y_test)
#得到最佳的特征组合
from sklearn.cross_validation import cross_val_score
import numpy as np
percentiles = range(1,100,2)
results = []
for i in percentiles:
fs = feature_selection.SelectPercentile(feature_selection.chi2,percentile=i)
X_train_fs = fs.fit_transform(X_train,y_train)
scores = cross_val_score(dt,X_train_fs,y_train,cv=5)
results = np.append(results,scores.mean())
print (results)
#找到最佳的百分比
opt = np.where(results==results.max())[0][0]
print(opt)
print('Optimal number of features ',percentiles[opt])
3、交叉验证
在真正的机器学习平台行实践机器学习任务时就会发现,我们只可以提交预测结果,不能知晓正确答案,这就要求我们充分使用现有数据,通常的做法时对现有数据进行采样分割,一部分用于模型参数训练,一部分用于调优模型配置和特征选择,并且对未知的测试性能作出估计,叫做验证集(Validation),根据验证流程复杂度不同,模型检验方式分为留一验证和交叉验证:
titanic = pd.read_csv('http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt')
y = titanic['survived']
X = titanic.drop(['row.names','name','survived'],axis=1)
X['age'].fillna(X['age'].mean(),inplace=True)
X.fillna('UNKNOWN',inplace=True)
from sklearn.cross_validation import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=33)
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer()
X_train = vec.fit_transform(X_train.to_dict(orient='record'))
X_test = vec.transform(X_test.to_dict(orient='record'))
from sklearn.cross_validation import cross_val_score
fs = feature_selection.SelectPercentile(feature_selection.chi2,percentile=i)
X_train_fs = fs.fit_transform(X_train,y_train)
scores = cross_val_score(dt,X_train_fs,y_train,cv=5)
print (scores.mean())
4、网格搜索
一般情况下模型有许多参数需要配置,这些参数我们一般统称为模型的超参数,如K近邻中的K值,支持向量机中的不同的核函数等等,多数情况下,超参数的选择是无限的,因此在有限的时间内,除了可以验证人工预设几种超参数组合外,也可以通过启发式的搜索方法对超参数组合进行调优,这种启发式的超参数搜索方法称为网格搜索。
from sklearn.datasets import fetch_20newsgroups
news = fetch_20newsgroups(subset='all')
from sklearn.cross_validation import train_test_split
X_train,X_test,y_train,y_test = train_test_split(news.data[:3000],news.target[:3000],test_size=0.25,random_state=33)
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer()
X_count_train = vec.fit_transform(X_train)
X_count_test = vec.transform(X_test)
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
#使用pipeline简化系统搭建流程,将文本抽取与分类器模型串联起来
clf = Pipeline([
('vect',TfidfVectorizer(stop_words='english',analyzer='word')),('svc',SVC())
])
parameters = {
'svc__gamma':np.logspace(-2,1,4),
'svc__C':np.logspace(-1,1,3)
}
#n_jobs=-1代表使用计算机的全部CPU
from sklearn.grid_search import GridSearchCV
gs = GridSearchCV(clf,parameters,verbose=2,refit=True,cv=3,n_jobs=-1)
%time _=gs.fit(X_train,y_train)
print (gs.best_params_,gs.best_score_)
print (gs.score(X_test,y_test))