[scikit-learn 机器学习] 6. 逻辑回归

文章目录

    • 1. 逻辑回归二分类
    • 2. 垃圾邮件过滤
      • 2.1 性能指标
      • 2.2 准确率
      • 2.3 精准率、召回率
      • 2.4 F1值
      • 2.5 ROC、AUC
    • 3. 网格搜索调参
    • 4. 多类别分类
    • 5. 多标签分类
      • 5.1 多标签分类性能指标

本文为 scikit-learn机器学习(第2版)学习笔记

逻辑回归常用于分类任务

1. 逻辑回归二分类

《统计学习方法》逻辑斯谛回归模型( Logistic Regression,LR)

定义:设 X X X 是连续随机变量, X X X 服从 logistic 分布是指 X X X 具有下列分布函数和密度函数:

F ( x ) = P ( X ≤ x ) = 1 1 + e − ( x − μ ) / γ F(x) = P(X \leq x) = \frac{1}{1+e^{{-(x-\mu)} / \gamma}} F(x)=P(Xx)=1+e(xμ)/γ1

f ( x ) = F ′ ( x ) = e − ( x − μ ) / γ γ ( 1 + e − ( x − μ ) / γ ) 2 f(x)=F'(x)= \frac {e^{{-(x-\mu)} / \gamma}}{\gamma {(1+e^{{-(x-\mu)}/\gamma})}^2} f(x)=F(x)=γ(1+e(xμ)/γ)2e(xμ)/γ

[scikit-learn 机器学习] 6. 逻辑回归_第1张图片

在逻辑回归中,当预测概率 >= 阈值,预测为正类,否则预测为负类

2. 垃圾邮件过滤

从信息中提取 TF-IDF 特征,并使用逻辑回归进行分类

import pandas as pd
data = pd.read_csv("SMSSpamCollection", delimiter='\t',header=None)
data

[scikit-learn 机器学习] 6. 逻辑回归_第2张图片

data[data[0]=='ham'][0].count() # 4825 条正常信息
data[data[0]=='spam'][0].count() # 747 条垃圾信息
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score

X = data[1].values
y = data[0].values
from sklearn.preprocessing import LabelBinarizer
lb = LabelBinarizer()
y = lb.fit_transform(y)

X_train_raw, X_test_raw, y_train, y_test = train_test_split(X, y, random_state=520)

vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train_raw)
X_test = vectorizer.transform(X_test_raw)

classifier = LogisticRegression()
classifier.fit(X_train, y_train)

pred = classifier.predict(X_test)
for i, pred_i in enumerate(pred[:5]):
    print("预测为:%s, 信息为:%s,真实为:%s" %(pred_i,X_test_raw[i],y_test[i]))
预测为:0, 信息为:Aww that's the first time u said u missed me without asking if I missed u first. You DO love me! :),真实为:[0]
预测为:0, 信息为:Poor girl can't go one day lmao,真实为:[0]
预测为:0, 信息为:Also remember the beads don't come off. Ever.,真实为:[0]
预测为:0, 信息为:I see the letter B on my car,真实为:[0]
预测为:0, 信息为:My love ! How come it took you so long to leave for Zaher's? I got your words on ym and was happy to see them but was sad you had left. I miss you,真实为:[0]

2.1 性能指标

混淆矩阵

from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
confusion_matrix = confusion_matrix(y_test, pred)
plt.matshow(confusion_matrix)
plt.rcParams["font.sans-serif"]= 'SimHei' # 消除中文乱码
plt.title("混淆矩阵")
plt.ylabel('真实')
plt.xlabel('预测')
plt.colorbar()

[scikit-learn 机器学习] 6. 逻辑回归_第3张图片

2.2 准确率

scores = cross_val_score(classifier, X_train, y_train, cv=5)
print('Accuracies: %s' % scores)
print('Mean accuracy: %s' % np.mean(scores))
Accuracies: [0.94976077 0.95933014 0.96650718 0.95215311 0.95688623]
Mean accuracy: 0.9569274847434318

准确率不是一个很合适的性能指标,它不能区分预测错误,是正预测为负,还是负预测为正

2.3 精准率、召回率

可以参考 [Hands On ML] 3. 分类(MNIST手写数字预测)
[scikit-learn 机器学习] 6. 逻辑回归_第4张图片

单独只看精准率或者召回率是没有意义的

from sklearn.metrics import precision_score, recall_score, f1_score
precisions = precision_score(y_test, pred)
print('Precision: %s' % precisions)
recalls = recall_score(y_test, pred)
print('Recall: %s' % recalls)
Precision: 0.9852941176470589
预测为垃圾信息的基本上真的是垃圾信息

Recall: 0.697916666666666630%的垃圾信息预测为了非垃圾信息

2.4 F1值

F1 值是以上精准率和召回率的均衡

f1s = f1_score(y_test, pred)
print('F1 score: %s' % f1s)
# F1 score: 0.8170731707317074

2.5 ROC、AUC

  • 好的分类器AUC面积越接近1越好,随机分类器AUC面积为0.5
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score

false_positive_rate, recall, thresholds = roc_curve(y_test, pred)
roc_auc_score  = roc_auc_score(y_test, pred)

plt.title('受试者工作特性')
plt.plot(false_positive_rate, recall, 'b', label='AUC = %0.2f' % roc_auc_score)
plt.legend(loc='lower right')
plt.plot([0, 1], [0, 1], 'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.ylabel('Recall')
plt.xlabel('Fall-out')
plt.show()

[scikit-learn 机器学习] 6. 逻辑回归_第5张图片

3. 网格搜索调参

import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, accuracy_score


pipeline = Pipeline([
    ('vect', TfidfVectorizer(stop_words='english')),
    ('clf', LogisticRegression())
])
parameters = {
    'vect__max_df': (0.25, 0.5, 0.75), # 模块name__参数name
    'vect__stop_words': ('english', None),
    'vect__max_features': (2500, 5000, None),
    'vect__ngram_range': ((1, 1), (1, 2)),
    'vect__use_idf': (True, False),
    'clf__penalty': ('l1', 'l2'),
    'clf__C': (0.01, 0.1, 1, 10),
}

if __name__ == "__main__":
    df = pd.read_csv('./SMSSpamCollection', delimiter='\t', header=None)
    X = df[1].values
    y = df[0].values
    label_encoder = LabelEncoder()
    y = label_encoder.fit_transform(y)
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    
    grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='accuracy', cv=3)
    grid_search.fit(X_train, y_train)
    
    print('Best score: %0.3f' % grid_search.best_score_)
    print('Best parameters set:')
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print('\t%s: %r' % (param_name, best_parameters[param_name]))
        
    predictions = grid_search.predict(X_test)
    print('Accuracy: %s' % accuracy_score(y_test, predictions))
    print('Precision: %s' % precision_score(y_test, predictions))
    print('Recall: %s' % recall_score(y_test, predictions))
Best score: 0.985
Best parameters set:
	clf__C: 10
	clf__penalty: 'l2'
	vect__max_df: 0.5
	vect__max_features: 5000
	vect__ngram_range: (1, 2)
	vect__stop_words: None
	vect__use_idf: True
Accuracy: 0.9791816223977028
Precision: 1.0
Recall: 0.8605769230769231

调整参数后,提高了召回率

4. 多类别分类

电影情绪评价预测

data = pd.read_csv("./chapter5_movie_train.csv",header=0,delimiter='\t')
data

[scikit-learn 机器学习] 6. 逻辑回归_第6张图片

data['Sentiment'].describe()
count    156060.000000
mean          2.063578
std           0.893832
min           0.000000
25%           2.000000
50%           2.000000
75%           3.000000
max           4.000000
Name: Sentiment, dtype: float64

平均都是比较中立的情绪

data["Sentiment"].value_counts()/data["Sentiment"].count()
2    0.509945
3    0.210989
1    0.174760
4    0.058990
0    0.045316
Name: Sentiment, dtype: float64

50% 的例子都是中立的情绪

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

df = pd.read_csv('./chapter5_movie_train.csv', header=0, delimiter='\t')
X, y = df['Phrase'], df['Sentiment'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5)

pipeline = Pipeline([
    ('vect', TfidfVectorizer(stop_words='english')),
    ('clf', LogisticRegression())
])
parameters = {
    'vect__max_df': (0.25, 0.5),
    'vect__ngram_range': ((1, 1), (1, 2)),
    'vect__use_idf': (True, False),
    'clf__C': (0.1, 1, 10),
}

grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='accuracy')
grid_search.fit(X_train, y_train)

print('Best score: %0.3f' % grid_search.best_score_)
print('Best parameters set:')
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print('\t%s: %r' % (param_name, best_parameters[param_name]))
Best score: 0.619
Best parameters set:
	clf__C: 10
	vect__max_df: 0.25
	vect__ngram_range: (1, 2)
	vect__use_idf: False
  • 性能指标
predictions = grid_search.predict(X_test)

print('Accuracy: %s' % accuracy_score(y_test, predictions))
print('Confusion Matrix:')
print(confusion_matrix(y_test, predictions))
print('Classification Report:')
print(classification_report(y_test, predictions))
Accuracy: 0.6292323465333846
Confusion Matrix:
[[ 1013  1742   682   106    11]
 [  794  5914  6275   637    49]
 [  196  3207 32397  3686   222]
 [   28   488  6513  8131  1299]
 [    1    59   548  2388  1644]]
Classification Report:
              precision    recall  f1-score   support

           0       0.50      0.29      0.36      3554
           1       0.52      0.43      0.47     13669
           2       0.70      0.82      0.75     39708
           3       0.54      0.49      0.52     16459
           4       0.51      0.35      0.42      4640

    accuracy                           0.63     78030
   macro avg       0.55      0.48      0.50     78030
weighted avg       0.61      0.63      0.62     78030

5. 多标签分类

  • 一个实例可以被贴上多个 labels

问题转换:

  • 实例的标签(假设为L1,L2),转换成(L1 and L2),以此类推,缺点,产生很多种类的标签,且模型只能训练数据中包含的类,很多可能无法覆盖到
  • 对每个标签,训练一个二分类器(这个实例是L1吗,是L2吗?),缺点,忽略了标签之间的关系

5.1 多标签分类性能指标

  • 汉明损失:不正确标签的平均比例,0最好
  • 杰卡德相似系数:预测与真实标签的交集数量 / 并集数量,1最好
from sklearn.metrics import hamming_loss, jaccard_score
# help(jaccard_score)

print(hamming_loss(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[0.0, 1.0], [1.0, 1.0]])))

print(hamming_loss(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[1.0, 1.0], [1.0, 1.0]])))

print(hamming_loss(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[1.0, 1.0], [0.0, 1.0]])))

print(jaccard_score(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[0.0, 1.0], [1.0, 1.0]]),average=None))

print(jaccard_score(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[1.0, 1.0], [1.0, 1.0]]),average=None))

print(jaccard_score(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[1.0, 1.0], [0.0, 1.0]]),average=None))
0.0
0.25
0.5
[1. 1.]
[0.5 1. ]
[0. 1.]

你可能感兴趣的:(机器学习)