机器学习基础之线性模型

用于回归的线性模型

线性回归

什么是线性回归呢？

回归问题的一般模型如下：
$$y = \sum w[i]*x[i]+b$$
如下图所示，对于一维数据，线性回归就是根据给定的点$(x_i,y_i)$拟合出一条直线$$y=ax+b$$即求出系数a、b。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import mglearn
import warnings
warnings.filterwarnings('ignore')

mglearn.plots.plot_linear_regression_wave()

w[0]: 0.393906  b: -0.031804

推广到多维数据，线性回归模型的训练过程就是寻找参数向量w的过程,只是拟合的目标变为了高维平面，线性回归最常用的两种方法是最小二乘法（OLS）和梯度下降法,使用python实现线性回归有sklearn和statsmodel两个包可用。sklearn是机器学习常用包，statsmodel更偏向于统计学。首先，我们使用sklearn的LinearRegression训练模型：

from sklearn.linear_model import LinearRegression 
from sklearn.model_selection import train_test_split 

X,y = mglearn.datasets.make_wave(n_samples=60) #导入数据

#数据集划分，同一random_state表示对数据集进行相同的划分
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=42)
#标准的sklearn风格API，Model（）.fit（X,y）
lr = LinearRegression().fit(X_train,y_train)

print('系数:{}'.format(lr.coef_))
print('截距：{}'.format(lr.intercept_))
print('训练精度：{}'.format(lr.score(X_train,y_train)))
print('测试精度：{}'.format(lr.score(X_test,y_test)))

系数:[0.39390555]
截距：-0.031804343026759746
训练精度：0.6700890315075756
测试精度：0.65933685968637

sklearn中使用OLS拟合模型，score是可决系数，可以看出，测试集可决系数只有0.65左右，可以说效果并不好，这是因为原数据为一维数据，当数据维度增加时，线性模型可以变得十分强大。下面，我们再使用statsmodel来训练模型：

import statsmodels.api as sm

#给模型添加常数项，如果不执行，则训练出的直线过原点
x = sm.add_constant(X_train)
#训练模型
ols = sm.OLS(y_train,x).fit()
#输出统计报告
print(ols.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.670
Model:                            OLS   Adj. R-squared:                  0.662
Method:                 Least Squares   F-statistic:                     87.34
Date:                Sun, 22 Sep 2019   Prob (F-statistic):           6.46e-12
Time:                        21:54:35   Log-Likelihood:                -33.187
No. Observations:                  45   AIC:                             70.37
Df Residuals:                      43   BIC:                             73.99
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.0318      0.078     -0.407      0.686      -0.190       0.126
x1             0.3939      0.042      9.345      0.000       0.309       0.479
==============================================================================
Omnibus:                        0.703   Durbin-Watson:                   2.369
Prob(Omnibus):                  0.704   Jarque-Bera (JB):                0.740
Skew:                          -0.081   Prob(JB):                        0.691
Kurtosis:                       2.393   Cond. No.                         1.90
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

summary输出的是一张类似eviews或者minitab统计风格的表，可以看到，可决系数R-squared是0.67，与sklearn结果相同，并且，模型的F-statistic以及参数的t值都表明，结果是显著的。

普通最小二乘法（Ordinary Least Square，OLS）

最小二乘法基于这样一个目标，使得数据的实际值 $y_i$ 与预测值 $\hat{y_i}$之间的偏差最小，即损失函数最小，OLS使用均方误差（Mean Square Error，MSE）作为损失函数，优化目标为$$min\quad MSE=\frac 1n\sum_{i=1}^n(y_i-\hat{y_i})^2$$对于一维数据而言$$MSE=\frac 1n\sum_{i=1}^n(y_i-\theta_0-\theta_1x_i)^2$$为求最小值，需要求偏导$$\frac {\partial MSE}{\partial\theta_0}=-\frac 2n\sum_{i=1}^n(y_i-\theta_0-\theta_1x_i)=0$$,$$\frac {\partial MSE}{\partial\theta_1}=-\frac 2n\sum_{i=1}^n(y_i-\theta_0-\theta_1x_i)x_i=0$$联立可得$$\theta_1=\frac {\sum(y_i-\bar y)(x_i-\bar x)}{(x_i-\bar x)^2}$$,$$\theta_0=\bar y -\theta_1 \bar x$$

对于多元回归同理，下面是矩阵解法,损失函数定义为$$J(\mathbf\theta) = \frac{1}{2}(\mathbf{X\theta} - \mathbf{Y})^T(\mathbf{X\theta} - \mathbf{Y})$$求导$$\frac{\partial}{\partial\mathbf\theta}J(\mathbf\theta) = \mathbf{X}^T(\mathbf{X\theta} - \mathbf{Y}) = 0$$最终可得$$\mathbf{\theta} = (\mathbf{X^{T}X})^{-1}\mathbf{X^{T}Y}$$

最小二乘原理是通过求导的方式最小化MSE以求得参数$\theta$，下面我们介绍另一种方法梯度下降法。

梯度下降法

梯度下降法是一个比较纯粹的计算机编程方法。

如图所示，我们知道，损失函数是系数的函数，一元线性回归有两个参数，组成了损失函数面，我们首先随机制定一组系数，即在上图平面上随机选取一个初始点，然后同时进行以下变换$$\theta_0 = \theta_0-\alpha\frac{\partial J(\theta)}{\partial \theta_0}$$$$\theta_1 = \theta_1-\alpha\frac{\partial J(\theta)}{\partial \theta_1}$$其中“=”为赋值号，重复该步骤直到终止。

我们来分析一下发生了什么。首先，偏导数的系数$\alpha$是正数。对于偏导数而言，当偏导大于零时候，$J(\theta)$随$\theta_i$增大而增大，同时，新的$\theta_i$小于旧的$\theta_i$，因此，$J(\theta)$减小；当偏导数小于零的时候，$J(\theta)$随着$\theta_i$增大而减小，同时，新的$\theta_i$大于旧的$\theta_i$，因此，$J(\theta)$还是减小，即每次循环，损失函数都会减小，最终到达一个局部的最小值，如上图所示。

我们的损失函数是凸函数，并不是有多个极小值的图形，其真实图形如下所示，极小值即为最小值。

算法步骤：

确定损失函数
初始化系数、步长
更新系数
重复以上三部直到结束

梯度下降法家族

批量梯度下降法（Batch Gradient Descent），也就是之前所述的方法，每次更新后都使用所有数据来计算损失函数和梯度。
随机梯度下降法（Stochastic Gradient Descent），每次只使用一个随机数据求梯度。
小批量梯度下降法（Mini-batch Gradient Descent），使用部分数据求梯度。

线性回归的推广：多项式回归

对于一元线性回归，当因变量y与x并不成线性关系时，无法直接使用线性回归。根据泰勒定理：

令a=0可知，y可以由$x,x^2,x^3...$线性表示，因此，可以将$x^n$看作额外的变量，将一元线性回归转化为多元线性回归，以此来增加模型的准确性。

广义线性回归

即通过取对数将原本无线性关系的变量转化为近似线性关系以应用线性回归$$ln\mathbf{Y} = \mathbf{X\theta}$$

岭回归（Ridge）

岭回归使用L2正则化处理回归模型，其惩罚项为L2范数，惩罚项系数为正数，对应sklearn.Ridge中参数alpha，增大alpha会导致系数趋向于0，从而降低训练集性能，是解决过拟合的一种方法，同样的，sklearn.ridge使用OLS。
$$J(\mathbf\theta) = \frac{1}{2}(\mathbf{X\theta} - \mathbf{Y})^T(\mathbf{X\theta} - \mathbf{Y}) + \frac{1}{2}\alpha||\theta||_2^2$$

#岭回归在sklearn中的实现
from sklearn.linear_model import Ridge

X,y = mglearn.datasets.load_extended_boston()
print('数据规模：{}'.format(X.shape))
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state = 0)

ridge = Ridge(alpha=1).fit(X_train,y_train)
LR = LinearRegression().fit(X_train,y_train)

print('线性回归精度（[训练，测试]）：{}'.format([LR.score(X_train,y_train),LR.score(X_test,y_test)]))
print('岭回归精度（[训练，测试]）：{}'.format([ridge.score(X_train,y_train),ridge.score(X_test,y_test)]))

数据规模：(506, 104)
线性回归精度（[训练，测试]）：[0.9520519609032729, 0.6074721959665752]
岭回归精度（[训练，测试]）：[0.885796658517094, 0.7527683481744755]

可以看出，由于boston数据集拥有104个特征，但只有506条数据。线性回归具有十分明显的过拟合，岭回归模型训练精度低于线性回归，但是测试精度高于线性回归。

还可以通过增加数据量来解决过拟合问题,如下图所示，当数据量增大时，线性回归的测试精度与ridge相似。

mglearn.plots.plot_ridge_n_samples()

lasso

lasso使用L1正则化，惩罚项是L1范数，但是可以使某特征系数为0，模型更容易解释，也可以呈现模型的重要特征，由于使用绝对值，存在不可导点，因此OLS、梯度下降都不可用。$$J(\mathbf\theta) = \frac{1}{2n}(\mathbf{X\theta} - \mathbf{Y})^T(\mathbf{X\theta} - \mathbf{Y}) + \alpha||\theta||_1$$

#lasso实现

from sklearn.linear_model import Lasso

lasso = Lasso(alpha=0.1,max_iter=1000000).fit(X_train,y_train)
print('训练精度：{}'.format(lasso.score(X_train,y_train)))
print('测试精度：{}'.format(lasso.score(X_test,y_test)))
print('模型所用特征数：{}'.format(np.sum(lasso.coef_ !=0)))

训练精度：0.7709955157630054
测试精度：0.6302009976110041
模型所用特征数：8

ElasticNet

ElasticNet同时使用L1、L2范数进行正则化，$$J(\mathbf\theta) = \frac{1}{2m}(\mathbf{X\theta} - \mathbf{Y})^T(\mathbf{X\theta} - \mathbf{Y}) + \alpha\rho||\theta||_1 + \frac{\alpha(1-\rho)}{2}||\theta||_2^2$$

用于二分类的线性模型

用于二分类的线性模型本可以用以下公式预测：$$y = \sum w[i]*x[i]+b>0$$

常用的二分类模型有Logistic回归（Logistic Regression）和线性支持向量机（Linear Support Vector Machine，线性SVM）。

Logistic Regression

Logistic Regression是将线性回归所得因变量y进行非线性（Sigmoid函数）变换映射到[0,1]之内，作为分类样本点分属0，1两类的概率。

逻辑回归的理解

Sigmoid函数$$g(z) = \frac{1}{1+e^{-z}}$$其图像如下，在x=0处函数值为0.5，x趋向于无穷时，函数值分别趋向0和1。如果$${z = x\theta}$$那么就把线性回归所得的函数值映射到了0-1之间，$g(z)$可以看作分类为1的概率，越靠近1，被分类为1的概率越大，在临界值0.5处最容易被误分类。

X = np.linspace(-10,10)
y = []
for i in X:
    y.append(1/(1+np.exp(-i)))
plt.plot(X,y)
plt.xlabel('z')
plt.ylabel('g(z)')

逻辑回归的原理

对于每个样本点$(x_i,y_i)$，$y_i=1,y_i=0$的概率分别为$$P(y_i=1|x_i,\theta)=h_\theta(x_i)$$$$P(y=0|x_i,\theta)=1-h_\theta(x_i)$$将其合并为$$P(y_i|x_i,\theta)=h_\theta(x_i)^{y_i}(1-h_\theta(x_i))^{1-y_i}$$假设每个样本点独立同分布，样本数为n，由最大似然法（MLE）构造似然函数得$$L(\theta)=\prod _{i=1}^nP(y_i|x_i,\theta)$$由于似然函数表示的是取得现有样本的概率，应当予以最大化，因此，取似然函数的对数的相反数作为损失函数$$J(\theta) = -lnL(\theta) = -\sum\limits_{i=1}^{m}(y_iln(h_{\theta}(x_i))+ (1-y_i)ln(1-h_{\theta}(x_i)))$$

求偏导得$$\frac{\partial}{\partial\theta}J(\theta) = X^T(h_{\theta}(X) - Y )$$

使用梯度下降法$$\theta = \theta - \alpha X^T(h_{\theta}(X) - Y )$$

sklearn实现

#乳腺癌数据上使用Logistic Regression
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression

cancer = load_breast_cancer()
X_train,X_test,y_train,y_test = train_test_split(cancer.data,cancer.target,stratify = cancer.target,random_state=42)

for C,maker in zip([0.001,1,100],['o','^','v']):
    logistic = LogisticRegression(C = C,penalty='l2').fit(X_train,y_train)
    print('训练精度（C={})：{}'.format(C,logistic.score(X_train,y_train)))
    print('训练精度（C={})：{}'.format(C,logistic.score(X_test,y_test)))
    plt.plot(logistic.coef_.T,maker,label = 'C={}'.format(C))
plt.xticks(range(cancer.data.shape[1]),cancer.feature_names,rotation = 90)
plt.xlabel('Coefficient Index')
plt.ylabel('Coefficient')
plt.legend()

训练精度（C=0.001)：0.9225352112676056
训练精度（C=0.001)：0.9370629370629371
训练精度（C=1)：0.9553990610328639
训练精度（C=1)：0.958041958041958
训练精度（C=100)：0.971830985915493
训练精度（C=100)：0.965034965034965

Logistic Regression也可以使用正则化，方法同样是在损失函数后增加正则化项。sklearn中LogisticRegression默认使用L2正则化，参数penalty可修改正则化方式。上图是不同正则化参数训练所得模型系数，可以看出skleran中正则化项C越小，正则化程度越强，参数的变换范围越小。这是由于取损失函数时取了相反数，-C相当于lasso中的$\alpha$,C越小，-C越大。$$J(\mathbf\theta) = -\sum\limits_{i=1}^{m}(y_iln(h_{\theta}(x_i))+ (1-y_i)ln(1-h_{\theta}(x_i))) - \frac{1}{2}C||\theta||_2^2$$

线性SVC

见SVM

用于多分类的线性模型

许多线性模型不适用于多分类问题，可以使用一对其余的方法，如数据分为A、B、C三类，则需要训练三个分类器分别对应三个类别，如A类的分类器将数据分为A类和不是A类，对于同时属于多个类别（如同时属于A类和B类）以及不属于任何一类的数据，则分给得分高的类别。

from sklearn.datasets import make_blobs

X,y=make_blobs(random_state=42)
mglearn.discrete_scatter(X[:,0],X[:,1],y)
plt.xlabel('Feature 0')
plt.ylabel('Featrue 1')
plt.legend(['Class0','Class1','Class2'])

对于以上数据使用线性SVM进行分类：

from sklearn.svm import LinearSVC
LSVC=LinearSVC().fit(X,y)
LSVC.coef_,LSVC.intercept_,LSVC.score(X,y)

(array([[-0.17492558,  0.23141285],
        [ 0.4762191 , -0.06937294],
        [-0.18914557, -0.20399693]]),
 array([-1.0774515 ,  0.13140521, -0.08604887]),
 1.0)

可以看出，LinearSVC输出了三条直线，每条直线将一个类别与其他类别分开，下面将其可视化，三条直线将其分为7个区域，交叉区域平均分配。

mglearn.plots.plot_2d_classification(LSVC,X,fill=True,alpha=.7)
mglearn.discrete_scatter(X[:,0],X[:,1],y)
line = np.linspace(-10,10)
for coef,intercept,color in zip(LSVC.coef_,LSVC.intercept_,['b','r','g']):
    plt.plot(line,-(line*coef[0]+intercept)/coef[1])
plt.xlabel('Feature 0')
plt.ylabel('Featrue 1')
plt.legend(['Class0','Class1','Class2','Line class 0','Line class 1','Line class 2'],loc=(1.01,0))