将Logistic回归和线性支持向量机应用到forge数据集上,
import numpy as np
import pandas as pd
import mglearn
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
X,y = mglearn.datasets.make_forge()
fig,axes = plt.subplots(1,2,figsize=(10,3))
for model,ax in zip([LinearSVC(),LogisticRegression()],axes):
clf = model.fit(X,y)
mglearn.plots.plot_2d_separator(clf,X,fill=True,eps=0.5,
ax=ax,alpha=.7)
mglearn.discrete_scatter(X[:,0],X[:,1],y,ax=ax)
ax.set_title('{}'.format(clf.__class__.__name__))
ax.set_xlabel('Feature 0')
ax.set_ylabel('Feature 1')
axes[0].legend()
两个模型都使用了L2正则化,两者决定正则化强度的参数为C,C值越大,正则化越弱。 下面使用LinearSVC来展示C值的大小对分类效果的影响,
mglearn.plots.plot_linear_svc_regularization()
下面在乳腺癌数据集上详细分析LogisticRegression,
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
X_train,X_test,y_train,y_test=train_test_split(
cancer.data,cancer.target,stratify=cancer.target,random_state=42)
logreg = LogisticRegression(max_iter=100000).fit(X_train,y_train)
print('训练集准确度:{:.3f}'.format(logreg.score(X_train,y_train)))
print('测试集准确度:{:.3f}'.format(logreg.score(X_test,y_test)))
训练集准确度:0.958
测试集准确度:0.958
在C=1的默认情况下,训练集和测试集精度都达到95%,性能相当不多,但由于二者比较接近,因此可能存在欠拟合的情况,我们尝试增大C来得到一个更复杂的模型,
logreg100 = LogisticRegression(C=100,max_iter=100000).fit(X_train,y_train)
print('训练集准确度:{:.3f}'.format(logreg100.score(X_train,y_train)))
print('测试集准确度:{:.3f}'.format(logreg100.score(X_test,y_test)))
训练集准确度:0.984
测试集准确度:0.965
可见使用C=100,提高了训练集精度,也使得测试集精度得到一定的提高,如果将C设置的小一些,模型会如何呢,
logreg001 = LogisticRegression(C=0.01,max_iter=100000).fit(X_train,y_train)
print('训练集准确度:{:.3f}'.format(logreg001.score(X_train,y_train)))
print('测试集准确度:{:.3f}'.format(logreg001.score(X_test,y_test)))
训练集准确度:0.953
测试集准确度:0.951
可见减小C,使得本就欠拟合的模型更加欠拟合,训练集和测试集的性能表现都下降。 下面比较一下C取不同值时模型学到的系数,
plt.plot(logreg.coef_.T,'s',label='C=1')
plt.plot(logreg100.coef_.T,'>',label='C=100')
plt.plot(logreg001.coef_.T,'<',label='C=0.001')
plt.xticks(range(cancer.data.shape[1]),cancer.feature_names,rotation=90)
plt.hlines(0,0,cancer.data.shape[1])
plt.ylabel('Coefficient magnitude')
plt.xlabel('Coefficient index')
plt.ylim(-5,5)
plt.legend()
由于LogisticRegression使用的是L2正则化,因此各特征的系数大小与分类类别的关系并不绝对。如果想求得一个解释性比较强的模型,使用L1正则化更好,因为它的约束模型只是用了少数几个特征。下面使用L1正则化求解模型,
for C,marker in zip([0.001,1,100],['o','^','v']):
lr_l1 = LogisticRegression(penalty='l1',C=C,solver="liblinear",max_iter=100000).fit(X_train,y_train)
print('Training accuracy of l1 logreg with C={:.3f}:{:.2f}'.format(
C,lr_l1.score(X_train,y_train)))
print('Test accuracy of l1 logreg with C={:.3f}:{:.2f}'.format(
C,lr_l1.score(X_test,y_test)))
plt.plot(lr_l1.coef_.T,marker,label='C={:.3f}'.format(C))
plt.xticks(range(cancer.data.shape[1]),cancer.feature_names,rotation=90)
plt.hlines(0,0,cancer.data.shape[1])
plt.xlabel('Coefficient index')
plt.ylabel('Coefficient magnitude')
plt.ylim(-5,5)
plt.legend()
Training accuracy of l1 logreg with C=0.001:0.91
Test accuracy of l1 logreg with C=0.001:0.92
Training accuracy of l1 logreg with C=1.000:0.96
Test accuracy of l1 logreg with C=1.000:0.96
Training accuracy of l1 logreg with C=100.000:0.99
Test accuracy of l1 logreg with C=100.000:0.98
用于多分类的线性模型 二分类算法推广到多分类问题,常使用的方法为“一对其余”。在测试点上运行所有的二分类器来进行预测,哪个分类器对应的类别分数最高,则判定为该类,返回类别标签作为预测结果。下面在一个三分类数据集上使用“一对其余”方法,数据集属于二维数据集,每个类别的数据都是从一个高斯分布中采样得出,
from sklearn.datasets import make_blobs
X,y = make_blobs(random_state=42)
mglearn.discrete_scatter(X[:,0],X[:,1],y)
plt.xlabel("Feature 0")
plt.ylabel("Feature 1")
plt.legend(["Class 0","Class 1","Class 2"])
linear_svm = LinearSVC().fit(X,y)
print('Coefficient shape:',linear_svm.coef_.shape)
print('Intercept shape:',linear_svm.intercept_.shape)
Coefficient shape: (3, 2)
Intercept shape: (3,)
coef_包括三个类别的系数向量,每列包含某个特征的系数值,intercept_包含截距, 我们将这3个二类分类器给出的直线可视化,
mglearn.discrete_scatter(X[:,0],X[:,1],y)
line = np.linspace(-15,15)
for coef,intercept,color in zip(linear_svm.coef_,linear_svm.intercept_,
['b','r','g']):
plt.plot(line,-(line*coef[0]+intercept)/coef[1],c=color)
plt.ylim(-10,15)
plt.xlim(-10,8)
plt.xlabel("Feature 0")
plt.ylabel("Feature 1")
plt.legend(["Class 0","Class 1","Class 2",'Line class 0','Line class 1',
'Line class 2'],loc=(1.01,0.3))
mglearn.plots.plot_2d_classification(linear_svm,X,fill=True,alpha=.7)
mglearn.discrete_scatter(X[:,0],X[:,1],y)
line = np.linspace(-15,15)
for coef,intercept,color in zip(linear_svm.coef_,linear_svm.intercept_,
['b','r','g']):
plt.plot(line,-(line*coef[0]+intercept)/coef[1],c=color)
plt.xlabel("Feature 0")
plt.ylabel("Feature 1")
plt.legend(["Class 0","Class 1","Class 2",'Line class 0','Line class 1',
'Line class 2'],loc=(1.01,0.3))
线性模型优缺点,训练速度和预测速度快,适用于非常大的数据集,当然对于稀疏矩阵也很有效。当特征量大于数据量时,线性模型的效果尤其明显。但在更低维的空间中,其它模型的泛化能力会更好一些。