在二元分类(binary classification)中,分类算法必须把一个实例配置两个类别。二元分类案例包括,预测患者是否患有某种疾病,音频中是否含有人声,杜克大学男子篮球队在NCAA比赛中第一场的输赢。
在逻辑回归里,响应变量描述了类似于掷一个硬币结果为正面的概率。如果响应变量等于或超过了指定的临界值,预测结果就是正面,否则预测结果就是反面。响应变量是一个像线性回归中的解释变量构成的函数表示,称为逻辑函数(logistic function)。一个值在{0,1}之间的逻辑函数如下所示:
- import matplotlib.pyplot as plt
- from matplotlib.font_manager import FontProperties
- font = FontProperties(fname=r"c:\windows\fonts\msyh.ttc", size=10)
- import numpy as np
- plt.figure()
- plt.axis([-6, 6, 0, 1])
- plt.grid(True)
- X = np.arange(-6,6,0.1)
- y = 1 / (1 + np.e ** (-X))
- plt.plot(X, y, 'b-');

在逻辑回归中, 是解释变量的线性组合,公式如下:

对数函数(logit function)是逻辑函数的逆运算:

二元分类问题就是垃圾邮件分类(spam classification)。这里,分类垃圾短信。先用TF-IDF算法来抽取短信的特征向量,然后用逻辑回归分类。
数据源:UCI Machine Learning Repository(的短信垃圾分类数据集(SMS SpamClassification Data Set)。
- import pandas as pd
- from sklearn.feature_extraction.text import TfidfVectorizer
- from sklearn.linear_model.logistic import LogisticRegression
- from sklearn.cross_validation import train_test_split
- df = pd.read_csv(r'D:\每日工作\学习笔记\test\mlslpic\SMSSpamCollection', delimiter='\t', header=None)
- print (df.head())
- print ('spam短信数量:',df[df[0]== 'spam'][0].count())
- print ('spam短信数量:',df[df[0]== 'ham'][0].count())
- out:
- 0 1
- 0 ham Go until jurong point, crazy.. Available only ...
- 1 ham Ok lar... Joking wif u oni...
- 2 spam Free entry in 2 a wkly comp to win FA Cup fina...
- 3 ham U dun say so early hor... U c already then say...
- 4 ham Nah I don't think he goes to usf, he lives aro...
- spam短信数量: 747
- ham短信数量: 4825
- import pandas as pd
- from sklearn.feature_extraction.text import TfidfVectorizer
- from sklearn.linear_model.logistic import LogisticRegression
- from sklearn.cross_validation import train_test_split, cross_val_score
- df = pd.read_csv(r'D:\每日工作\学习笔记\test\mlslpic\SMSSpamCollection', delimiter='\t', header=None)
- X_train_raw,X_test_raw,y_train,y_test = train_test_split(df[1],df[0])
- vectorizer = TfidfVectorizer()
- X_train = vectorizer.fit_transform(X_train_raw)
- X_test = vectorizer.transform(X_test_raw)
- classifer = LogisticRegression()
- predictions = classifer.predict(X_test)
- for i,predictions in enumerate(predictions[-5:]):
- print ('预测类型:%s. 信息: %s' %(predictions,X_test_raw.iloc[i]))
- out:
- 预测类型:ham. 信息: MOON has come to color your dreams, STARS to make them musical and my SMS to give you warm and Peaceful Sleep. Good Night
- 预测类型:ham. 信息: Your B4U voucher w/c 27/03 is MARSMS. Log onto for discount credit. To opt out reply stop. Customer care call 08717168528
- 预测类型:ham. 信息: Adult 18 Content Your video will be with you shortly
- 预测类型:ham. 信息: Had your mobile 11mths ? Update for FREE to Oranges latest colour camera mobiles & unlimited weekend calls. Call Mobile Upd8 on freefone 08000839402 or 2StopTxt
- 预测类型:ham. 信息: Well, I have to leave for my class babe ... You never came back to me ... :-( ... Hope you have a nice sleep, my love
- scores = cross_val_score(classifer,X_train,y_train,cv=5)
- print ('准确率:',np.mean(scores),scores)
- out:
- 准确率: 0.957646620634 [ 0.96052632 0.95933014 0.95454545 0.95095694 0.96287425]
- import numpy as np
- import pandas as pd
- from sklearn.feature_extraction.text import TfidfVectorizer
- from sklearn.linear_model.logistic import LogisticRegression
- from sklearn.cross_validation import train_test_split, cross_val_score
- df = pd.read_csv('mlslpic/sms.csv')
- X_train_raw, X_test_raw, y_train, y_test = train_test_split(df['message']
- , df['label'])
- vectorizer = TfidfVectorizer()
- X_train = vectorizer.fit_transform(X_train_raw)
- X_test = vectorizer.transform(X_test_raw)
- classifier = LogisticRegression()
-, y_train)
- scores = cross_val_score(classifier, X_train, y_train, cv=5)
- print('准确率:',np.mean(scores), scores)
- precisions = cross_val_score(classifier, X_train, y_train, cv=5, scoring='precision')
- print('精确率:', np.mean(precisions), precisions)
- recalls = cross_val_score(classifier, X_train, y_train, cv=5, scoring='recall')
- print('召回率:', np.mean(recalls), recalls)
- out:
- 准确率: 0.958373205742 [ 0.96291866 0.95334928 0.95813397 0.96172249 0.95574163]
- 精确率: 0.99217372134 [ 0.9875 0.98571429 1. 1. 0.98765432]
- 召回率: 0.672121212121 [ 0.71171171 0.62162162 0.66363636 0.63636364 0.72727273]
- f1s = cross_val_score(classifier, X_train, y_train, cv=5, scoring='f1')
- print('综合评价指标:', np.mean(f1s), f1s)
- from sklearn.metrics import roc_curve, auc
- predictions = classifier.predict_proba(X_test)
- false_positive_rate, recall, thresholds = roc_curve(y_test, predictions[:, 1])
- roc_auc = auc(false_positive_rate, recall)
- plt.title('Receiver Operating Characteristic')
- plt.plot(false_positive_rate, recall, 'b', label='AUC = %0.2f' % roc_auc)
- plt.legend(loc='lower right')
- plt.plot([0, 1], [0, 1], 'r--')
- plt.xlim([0.0, 1.0])
- plt.ylim([0.0, 1.0])
- plt.ylabel('Recall')
- plt.xlabel('Fall-out')
网格搜索(Grid search)就是用来确定最优超参数的方法。其原理就是选取可能的参数不断运行模型获取最佳效果。网格搜索用的是穷举法,其缺点在于即使每个超参数的取值范围都很小,计算量也是巨大的。不过这是一个并行问题,参数与参数彼此独立,计算过程不需要同步,所有很多方法都可以解决这个问题。scikit-learn有GridSearchCV()函数解决这个问题:
- import pandas as pd
- from sklearn.feature_extraction.text import TfidfVectorizer
- from sklearn.linear_model.logistic import LogisticRegression
- from sklearn.grid_search import GridSearchCV
- from sklearn.pipeline import Pipeline
- from sklearn.cross_validation import train_test_split
- from sklearn.metrics import precision_score, recall_score, accuracy_score
- pipeline = Pipeline([
- ('vect', TfidfVectorizer(stop_words='english')),
- ('clf', LogisticRegression())
- ])
- parameters = {
- 'vect__max_df': (0.25, 0.5, 0.75),
- 'vect__stop_words': ('english', None),
- 'vect__max_features': (2500, 5000, 10000, None),
- 'vect__ngram_range': ((1, 1), (1, 2)),
- 'vect__use_idf': (True, False),
- 'vect__norm': ('l1', 'l2'),
- 'clf__penalty': ('l1', 'l2'),
- 'clf__C': (0.01, 0.1, 1, 10),
- }
- grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, sc
- oring='accuracy', cv=3)
- df = pd.read_csv('mlslpic/sms.csv')
- X, y, = df['message'], df['label']
- X_train, X_test, y_train, y_test = train_test_split(X, y)
-, y_train)
- print('最佳效果:%0.3f' % grid_search.best_score_)
- print('最优参数组合:')
- best_parameters = grid_search.best_estimator_.get_params()
- for param_name in sorted(parameters.keys()):
- print('\t%s: %r' % (param_name, best_parameters[param_name]))
- predictions = grid_search.predict(X_test)
- print('准确率:', accuracy_score(y_test, predictions))
- print('精确率:', precision_score(y_test, predictions))
- print('召回率:', recall_score(y_test, predictions))
- out:
- [Parallel(n_jobs=-1)]: Done 1 jobs | elapsed: 1.8s
- [Parallel(n_jobs=-1)]: Done 50 jobs | elapsed: 10.1s
- [Parallel(n_jobs=-1)]: Done 200 jobs | elapsed: 27.4s
- [Parallel(n_jobs=-1)]: Done 450 jobs | elapsed: 54.2s
- [Parallel(n_jobs=-1)]: Done 800 jobs | elapsed: 1.6min
- [Parallel(n_jobs=-1)]: Done 1250 jobs | elapsed: 2.4min
- [Parallel(n_jobs=-1)]: Done 1800 jobs | elapsed: 3.4min
- [Parallel(n_jobs=-1)]: Done 2450 jobs | elapsed: 4.6min
- [Parallel(n_jobs=-1)]: Done 3200 jobs | elapsed: 6.0min
数据集可以从kaggle (
用烂番茄(Rotten Tomatoes)网站影评短语数据对电影进行评价。每个影评可以归入下面5个类项:不给力(negative),不太给力(somewhat negative),中等(neutral),有点给力(somewhat positive), 给力(positive)。解释变量不会总是直白的语言,因为影评内容千差万别,有讽刺的,否定的,以及其他语义的表述,语义并不直白。
- import pandas as pd
- from sklearn.feature_extraction.text import TfidfVectorizer
- from sklearn.linear_model.logistic import LogisticRegression
- from sklearn.cross_validation import train_test_split
- from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
- from sklearn.pipeline import Pipeline
- from sklearn.grid_search import GridSearchCV
- import zipfile
- pipeline = Pipeline([
- ('vect', TfidfVectorizer(stop_words='english')),
- ('clf', LogisticRegression())
- ])
- parameters = {
- 'vect__max_df': (0.25, 0.5),
- 'vect__ngram_range': ((1, 1), (1, 2)),
- 'vect__use_idf': (True, False),
- 'clf__C': (0.1, 1, 10),
- }
- z = zipfile.ZipFile(r'D:\每日工作\学习笔记\test\mlslpic\')
- df = pd.read_csv([0]), header=0, delimiter='\t')
- X, y = df['Phrase'], df['Sentiment'].as_matrix()
- X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5)
- grid_search = GridSearchCV(pipeline, parameters, n_jobs=3, verbose=1, scoring='accuracy')
-, y_train)
- print('最佳效果:%0.3f' % grid_search.best_score_)
- print('最优参数组合:')
- best_parameters = grid_search.best_estimator_.get_params()
- for param_name in sorted(parameters.keys()):
- print('\t%s: %r' % (param_name, best_parameters[param_name]))
- out:
- Fitting 3 folds for each of 24 candidates, totalling 72 fits
- [Parallel(n_jobs=3)]: Done 44 tasks | elapsed: 1.2min
- [Parallel(n_jobs=3)]: Done 72 out of 72 | elapsed: 3.1min finished
- 最佳效果:0.618
- 最优参数组合:
- clf__C: 10
- vect__max_df: 0.25
- vect__ngram_range: (1, 2)
- vect__use_idf: False
- 多类分类效果评估
- predictions = grid_search.predict(X_test)
- print('准确率:', accuracy_score(y_test, predictions))
- print('混淆矩阵:', confusion_matrix(y_test, predictions))
- print('分类报告:', classification_report(y_test, predictions))
- out:
- 准确率: 0.63526848648
- 混淆矩阵: [[ 1144 1747 597 74 11]
- [ 909 6011 6093 561 32]
- [ 228 3185 32607 3667 162]
- [ 23 399 6476 8203 1267]
- [ 2 40 479 2508 1605]]
- 分类报告: precision recall f1-score support
- 0 0.50 0.32 0.39 3573
- 1 0.53 0.44 0.48 13606
- 2 0.70 0.82 0.76 39849
- 3 0.55 0.50 0.52 16368
- 4 0.52 0.35 0.42 4634
- avg / total 0.62 0.64 0.62 78030
多标签分类(multi-label classification)。每个样本可以拥有全部类型的一部分类型。一般有两种解决方法:
问题转化方法(Problem transformation)可以将多标签问题转化成单标签问题。
loss)和杰卡德相似度(Jaccard similarity)。
杰卡德相似度或杰卡德相指数(Jaccardindex),是预测标签和真实标签的交集数量除以预测标签和真实标签的并集数量。其值在{0,1}之间,J(Predicted,True)=|Predicted ∩ True|/|Predicted ∪ True|
- import numpy as np
- from sklearn.metrics import hamming_loss, jaccard_similarity_score
- print(hamming_loss(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[0.0, 1.0], [1.0, 1.0]])))
- print(hamming_loss(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[1.0, 1.0], [1.0, 1.0]])))
- print(hamming_loss(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[1.0, 1.0], [0.0, 1.0]])))
- print(jaccard_similarity_score(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[0.0, 1.0], [1.0, 1.0]])))
- print(jaccard_similarity_score(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[1.0, 1.0], [1.0, 1.0]])))
- print(jaccard_similarity_score(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[1.0, 1.0], [0.0, 1.0]])))
- out:
- 0.0
- 0.25
- 0.5
- 1.0
- 0.75
- 0.5