scikit-learn官方文档:http://scikit-learn.org/stable/tutorial/
译文:https://muxuezi.github.io/posts/4-from-linear-regression-to-logistic-regression.html
目录:
1.二元分类:
>>逻辑回归
>>网格搜索
2.多元分类
3.多标签分类
1.二元分类:
>>逻辑回归
逻辑回归是用来做分类任务的。分类任务的目标是找一个函数,把观测值匹配到相关的类和标签上。学习算法必须用成对的特征向量和对应的标签来估计匹配函数的参数,从而实现更好的分类效果。
在二元分类(binary classification)中,分类算法必须把一个实例配置两个类别。二元分类案例包括,预测患者是否患有某种疾病,音频中是否含有人声,杜克大学男子篮球队在NCAA比赛中第一场的输赢。
多元分类中,分类算法需要为每个实例都分类一组标签。
在逻辑回归里,响应变量描述了类似于掷一个硬币结果为正面的概率。如果响应变量等于或超过了指定的临界值,预测结果就是正面,否则预测结果就是反面。响应变量是一个像线性回归中的解释变量构成的函数表示,称为逻辑函数(logistic function)。一个值在{0,1}之间的逻辑函数如下所示:
F(t)=1/(1+e(-t))
import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties
font = FontProperties(fname=r"c:\windows\fonts\msyh.ttc", size=10)
import numpy as np
plt.figure()
plt.axis([-6, 6, 0, 1])
plt.grid(True)
X = np.arange(-6,6,0.1)
y = 1 / (1 + np.e ** (-X))
plt.plot(X, y, 'b-');
在逻辑回归中, 是解释变量的线性组合,公式如下:
对数函数(logit function)是逻辑函数的逆运算:
定义了逻辑回归的模型之后,用它来完成一个分类任务。
#垃圾邮件分类
二元分类问题就是垃圾邮件分类(spam classification)。这里,分类垃圾短信。先用TF-IDF算法来抽取短信的特征向量,然后用逻辑回归分类。
数据源:UCI Machine Learning Repository(http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection)的短信垃圾分类数据集(SMS SpamClassification Data Set)。
#首先,用Pandas做一些描述性统计:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.cross_validation import train_test_split
df = pd.read_csv(r'D:\每日工作\学习笔记\test\mlslpic\SMSSpamCollection', delimiter='\t', header=None)
print (df.head())
print ('spam短信数量:',df[df[0]== 'spam'][0].count())
print ('spam短信数量:',df[df[0]== 'ham'][0].count())
out:
0 1
0 ham Go until jurong point, crazy.. Available only ...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup fina...
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives aro...
spam短信数量: 747
ham短信数量: 4825
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.cross_validation import train_test_split, cross_val_score
#用pandas加载数据.csv文件,然后用train_test_split分成训练集(75%)和测试集(25%):
df = pd.read_csv(r'D:\每日工作\学习笔记\test\mlslpic\SMSSpamCollection', delimiter='\t', header=None)
X_train_raw,X_test_raw,y_train,y_test = train_test_split(df[1],df[0])
#用TF-IDF算法来抽取短信的特征向量
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train_raw)
X_test = vectorizer.transform(X_test_raw)
#用逻辑回归分类
classifer = LogisticRegression()
classifer.fit(X_train,y_train)
predictions = classifer.predict(X_test)
for i,predictions in enumerate(predictions[-5:]):
print ('预测类型:%s. 信息: %s' %(predictions,X_test_raw.iloc[i]))
out:
预测类型:ham. 信息: MOON has come to color your dreams, STARS to make them musical and my SMS to give you warm and Peaceful Sleep. Good Night
预测类型:ham. 信息: Your B4U voucher w/c 27/03 is MARSMS. Log onto www.B4Utele.com for discount credit. To opt out reply stop. Customer care call 08717168528
预测类型:ham. 信息: Adult 18 Content Your video will be with you shortly
预测类型:ham. 信息: Had your mobile 11mths ? Update for FREE to Oranges latest colour camera mobiles & unlimited weekend calls. Call Mobile Upd8 on freefone 08000839402 or 2StopTxt
预测类型:ham. 信息: Well, I have to leave for my class babe ... You never came back to me ... :-( ... Hope you have a nice sleep, my love
#效果评估:
#准确率:scikit-learn提供了accuracy_score来计算:LogisticRegression.score()
#准确率是分类器预测正确性的比例,但是并不能分辨出假阳性错误和假阴性错误
scores = cross_val_score(classifer,X_train,y_train,cv=5)
print ('准确率:',np.mean(scores),scores)
out:
准确率: 0.957646620634 [ 0.96052632 0.95933014 0.95454545 0.95095694 0.96287425]
#精确率和召回率:
#精确率是指分类器预测出的垃圾短信中真的是垃圾短信的比例,P=TP/(TP+FP)
#召回率在医学上也叫做灵敏度,在本例中知所有真的垃圾短信被分类器正确找出来的比例,R=TP/(TP+FN)
#precisions = cross_val_score(classifer, X_train, y_train, cv=5, scoring='precision')
#print ('精确率:',np.mean(precisions),precisions)
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.cross_validation import train_test_split, cross_val_score
df = pd.read_csv('mlslpic/sms.csv')
X_train_raw, X_test_raw, y_train, y_test = train_test_split(df['message']
, df['label'])
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train_raw)
X_test = vectorizer.transform(X_test_raw)
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
scores = cross_val_score(classifier, X_train, y_train, cv=5)
print('准确率:',np.mean(scores), scores)
precisions = cross_val_score(classifier, X_train, y_train, cv=5, scoring='precision')
print('精确率:', np.mean(precisions), precisions)
recalls = cross_val_score(classifier, X_train, y_train, cv=5, scoring='recall')
print('召回率:', np.mean(recalls), recalls)
out:
准确率: 0.958373205742 [ 0.96291866 0.95334928 0.95813397 0.96172249 0.95574163]
精确率: 0.99217372134 [ 0.9875 0.98571429 1. 1. 0.98765432]
召回率: 0.672121212121 [ 0.71171171 0.62162162 0.66363636 0.63636364 0.72727273]
#分类器精确率99.2%,分类器预测出的垃圾短信中99.2%都是真的垃圾短信。召回率比较低67.2%,就是说真实的垃圾短信中,32.8%被当作正常短信了,没有被识别出来。
#综合评价指标
f1s = cross_val_score(classifier, X_train, y_train, cv=5, scoring='f1')
print('综合评价指标:', np.mean(f1s), f1s)
#综合评价指标是80%。由于精确率和召回率的差异比较小,所以综合评价指标的罚值也比较小。有时也会用F0.5和F2,表示精确率权重大于召回率,或召回率权重大于精确率。
#ROC AUC
#ROC曲线(Receiver Operating Characteristic,ROC curve)可以用来可视化分类器的效果。和准确率不同,ROC曲线对分类比例不平衡的数据集不敏感,ROC曲线显示的是对超过限定阈值的所有预测结果的分类器效果。ROC曲线画的是分类器的召回率与误警率(fall-out)的曲线。误警率也称假阳性率,是所有阴性样本中分类器识别为阳性的样本所占比例:
#F=FP/(TN+FP) AUC是ROC曲线下方的面积,它把ROC曲线变成一个值,表示分类器随机预测的效果. from sklearn.metrics import roc_curve, auc
from sklearn.metrics import roc_curve, auc
predictions = classifier.predict_proba(X_test)
false_positive_rate, recall, thresholds = roc_curve(y_test, predictions[:, 1])
roc_auc = auc(false_positive_rate, recall)
plt.title('Receiver Operating Characteristic')
plt.plot(false_positive_rate, recall, 'b', label='AUC = %0.2f' % roc_auc)
plt.legend(loc='lower right')
plt.plot([0, 1], [0, 1], 'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.ylabel('Recall')
plt.xlabel('Fall-out')
plt.show()
>>网格搜索
网格搜索(Grid search)就是用来确定最优超参数的方法。其原理就是选取可能的参数不断运行模型获取最佳效果。网格搜索用的是穷举法,其缺点在于即使每个超参数的取值范围都很小,计算量也是巨大的。不过这是一个并行问题,参数与参数彼此独立,计算过程不需要同步,所有很多方法都可以解决这个问题。scikit-learn有GridSearchCV()函数解决这个问题:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.cross_validation import train_test_split
from sklearn.metrics import precision_score, recall_score, accuracy_score
pipeline = Pipeline([
('vect', TfidfVectorizer(stop_words='english')),
('clf', LogisticRegression())
])
parameters = {
'vect__max_df': (0.25, 0.5, 0.75),
'vect__stop_words': ('english', None),
'vect__max_features': (2500, 5000, 10000, None),
'vect__ngram_range': ((1, 1), (1, 2)),
'vect__use_idf': (True, False),
'vect__norm': ('l1', 'l2'),
'clf__penalty': ('l1', 'l2'),
'clf__C': (0.01, 0.1, 1, 10),
}
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, sc
oring='accuracy', cv=3)
df = pd.read_csv('mlslpic/sms.csv')
X, y, = df['message'], df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y)
grid_search.fit(X_train, y_train)
print('最佳效果:%0.3f' % grid_search.best_score_)
print('最优参数组合:')
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print('\t%s: %r' % (param_name, best_parameters[param_name]))
predictions = grid_search.predict(X_test)
print('准确率:', accuracy_score(y_test, predictions))
print('精确率:', precision_score(y_test, predictions))
print('召回率:', recall_score(y_test, predictions))
out:
[Parallel(n_jobs=-1)]: Done 1 jobs | elapsed: 1.8s
[Parallel(n_jobs=-1)]: Done 50 jobs | elapsed: 10.1s
[Parallel(n_jobs=-1)]: Done 200 jobs | elapsed: 27.4s
[Parallel(n_jobs=-1)]: Done 450 jobs | elapsed: 54.2s
[Parallel(n_jobs=-1)]: Done 800 jobs | elapsed: 1.6min
[Parallel(n_jobs=-1)]: Done 1250 jobs | elapsed: 2.4min
[Parallel(n_jobs=-1)]: Done 1800 jobs | elapsed: 3.4min
[Parallel(n_jobs=-1)]: Done 2450 jobs | elapsed: 4.6min
[Parallel(n_jobs=-1)]: Done 3200 jobs | elapsed: 6.0min
GridSearchCV()函数的参数有待评估模型pipeline,超参数词典parameters和效果评价指
标scoring。n_jobs是指并发进程最大数量,设置为-1表示使用所有CPU核心进程。经过网格计算后的超参数在训练集中取得了很好的效
果。
2.多类分类:
scikit-learn用one-vs.-all或one-vs.-the-rest方法实现多类分类,就是把多类中的每个类都作为二元分类处理。分类器预测样本不同类型,将具有最大置信水平的类型作为样本类型。LogisticRegression()通过one-vs.-all策略支持多类分类。
数据集可以从kaggle (https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews)
用烂番茄(Rotten Tomatoes)网站影评短语数据对电影进行评价。每个影评可以归入下面5个类项:不给力(negative),不太给力(somewhat negative),中等(neutral),有点给力(somewhat positive), 给力(positive)。解释变量不会总是直白的语言,因为影评内容千差万别,有讽刺的,否定的,以及其他语义的表述,语义并不直白。
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
import zipfile
pipeline = Pipeline([
('vect', TfidfVectorizer(stop_words='english')),
('clf', LogisticRegression())
])
parameters = {
'vect__max_df': (0.25, 0.5),
'vect__ngram_range': ((1, 1), (1, 2)),
'vect__use_idf': (True, False),
'clf__C': (0.1, 1, 10),
}
z = zipfile.ZipFile(r'D:\每日工作\学习笔记\test\mlslpic\train.tsv.zip')
df = pd.read_csv(z.open(z.namelist()[0]), header=0, delimiter='\t')
X, y = df['Phrase'], df['Sentiment'].as_matrix()
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5)
grid_search = GridSearchCV(pipeline, parameters, n_jobs=3, verbose=1, scoring='accuracy')
grid_search.fit(X_train, y_train)
print('最佳效果:%0.3f' % grid_search.best_score_)
print('最优参数组合:')
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print('\t%s: %r' % (param_name, best_parameters[param_name]))
out:
Fitting 3 folds for each of 24 candidates, totalling 72 fits
[Parallel(n_jobs=3)]: Done 44 tasks | elapsed: 1.2min
[Parallel(n_jobs=3)]: Done 72 out of 72 | elapsed: 3.1min finished
最佳效果:0.618
最优参数组合:
clf__C: 10
vect__max_df: 0.25
vect__ngram_range: (1, 2)
vect__use_idf: False
多类分类效果评估
predictions = grid_search.predict(X_test)
print('准确率:', accuracy_score(y_test, predictions))
print('混淆矩阵:', confusion_matrix(y_test, predictions))
print('分类报告:', classification_report(y_test, predictions))
out:
准确率: 0.63526848648
混淆矩阵: [[ 1144 1747 597 74 11]
[ 909 6011 6093 561 32]
[ 228 3185 32607 3667 162]
[ 23 399 6476 8203 1267]
[ 2 40 479 2508 1605]]
分类报告: precision recall f1-score support
0 0.50 0.32 0.39 3573
1 0.53 0.44 0.48 13606
2 0.70 0.82 0.76 39849
3 0.55 0.50 0.52 16368
4 0.52 0.35 0.42 4634
avg / total 0.62 0.64 0.62 78030
3.多标签分类:
多标签分类(multi-label classification)。每个样本可以拥有全部类型的一部分类型。一般有两种解决方法:
问题转化方法(Problem transformation)可以将多标签问题转化成单标签问题。
方法1:训练集里面每个样本通过幂运算转换成单标签。这种幂运算虽然直观,但是并不实用,因为这样做多出来的标
签只有一小部分样本会用到。而且,这些标签只能在训练集里面学习这些类似,在测试集中依然无法使用。
方法2:每个标签都用二元分类处理。每个标签的分类器都预测样本是否属于该标签。这个问题确保了单标签问题和多标签问题有同样的训练集,只是忽略了标签之间的关联关系。
多标签分类效果评估:
最常用的手段是汉明损失函数(Hamming
loss)和杰卡德相似度(Jaccard similarity)。
汉明损失函数表示错误标签的平均比例,是一个函数,当预测全部正确,即没有错误标签时,值为0。
杰卡德相似度或杰卡德相指数(Jaccardindex),是预测标签和真实标签的交集数量除以预测标签和真实标签的并集数量。其值在{0,1}之间,J(Predicted,True)=|Predicted ∩ True|/|Predicted ∪ True|
import numpy as np
from sklearn.metrics import hamming_loss, jaccard_similarity_score
print(hamming_loss(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[0.0, 1.0], [1.0, 1.0]])))
print(hamming_loss(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[1.0, 1.0], [1.0, 1.0]])))
print(hamming_loss(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[1.0, 1.0], [0.0, 1.0]])))
print(jaccard_similarity_score(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[0.0, 1.0], [1.0, 1.0]])))
print(jaccard_similarity_score(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[1.0, 1.0], [1.0, 1.0]])))
print(jaccard_similarity_score(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[1.0, 1.0], [0.0, 1.0]])))
out:
0.0
0.25
0.5
1.0
0.75
0.5