小文 | 公众号 小文的数据之旅
上一期我们已经把线性回归的理论部分介绍完了,那么这一期当然是大家期待已久的实战篇了!下面将从stasmodels包的最小二乘法、skleran的最小二乘法、批量梯度下降法、随机梯度下降法和小批量随机梯度下降法等方式实现线性回归。 下面首先回忆一下几条重要的公式:
损失函数:
最小二乘法求最优参数:
梯度下降法求最优参数:
接下来要讲到的几种实现线性回归的方法都是基于这几条公式得来的!
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
import statsmodels.formula.api as smf
先看一下数据长什么样子?这是一个由3个特征和一个标签组成的数据集。另外通过统计发现3个特征值的分布不一,对于后续的建模会带来一定的影响。第一,会影响到梯度下降的速度;第二,会影响权重的大小,比如说TV这个特征值,最大值是296,均值是147,足足是radio的5倍多,这将会导致tv的权重远远大于radio,极端地来想sales的值只依赖于TV。为了避免这种情况的出现,在构建模型之前,需要先对数据进行预处理,这里采用z-score的方法处理数据!
data = pd.read_csv('./Desktop/Advertising.csv',sep = ',')
print(data.describe())
TV radio newspaper sales
count 200.000000 200.000000 200.000000 200.000000
mean 147.042500 23.264000 30.554000 14.022500
std 85.854236 14.846809 21.778621 5.217457
min 0.700000 0.000000 0.300000 1.600000
25% 74.375000 9.975000 12.750000 10.375000
50% 149.750000 22.900000 25.750000 12.900000
75% 218.825000 36.525000 45.100000 17.400000
max 296.400000 49.600000 114.000000 27.000000
#将数据集分成训练集与测试集,并对训练集进行预处理
train,test = train_test_split(data,test_size = 0.2,shuffle = True,random_state = 0)
train.iloc[:,:-1] = (train.iloc[:,:-1]-train.iloc[:,:-1].mean())/train.iloc[:,:-1].std()
另外,使用处理过后的特征值建模,每个特征对应的参数也就是上述说的权重代表了每个特征值的重要性!于是,我们可以通过权重值筛选有用的特征,Lasso回归就是基于这个原理进行特征选择的(权重很小)。那么如果某些特征本身就比其他特征更为重要的话,我们可以预先增大其权重,再进行建模,也就是局部加权线性回归。
局部加权线性回归参数最优解: ,其中w为权重矩阵,给各特征添加权重。
这里先不过多的拓展,先通过statsmodels包构建最小二乘法的线性模型。
#statsmodels包、最小二乘法
stats_model = smf.ols('sales~ TV + radio + newspaper',data = train).fit()
print(stats_model.summary())
OLS Regression Results :
==============================================================================
Dep. Variable: sales R-squared: 0.907
Model: OLS Adj. R-squared: 0.905
Method: Least Squares F-statistic: 505.4
Date: Wed, 19 Jun 2019 Prob (F-statistic): 4.23e-80
Time: 22:41:19 Log-Likelihood: -297.29
No. Observations: 160 AIC: 602.6
Df Residuals: 156 BIC: 614.9
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 14.2175 0.124 114.463 0.000 13.972 14.463
TV 3.7877 0.125 30.212 0.000 3.540 4.035
radio 2.8956 0.132 21.994 0.000 2.636 3.156
newspaper -0.0596 0.132 -0.451 0.653 -0.321 0.202
==============================================================================
Omnibus: 13.557 Durbin-Watson: 2.038
Prob(Omnibus): 0.001 Jarque-Bera (JB): 15.174
Skew: -0.754 Prob(JB): 0.000507
Kurtosis: 2.990 Cond. No. 1.42
==============================================================================
上一期我们提过,可以通过F检验判断模型的好坏,T检验判断参数是否显著和R方判断自变量对因变量的解释程度。从上图我们知道,stats_model的R方为0.907,说明拟合程度比较好,另外模型通过了F检验,而参数除了newspaper之外都通过了T检验,因此将特征newspaper去掉,再重新构建模型。
stats_model1 = sfa.ols('sales~ TV + radio',data = train).fit()
print(stats_model1.summary())
OLS Regression Results :
==============================================================================
Dep. Variable: sales R-squared: 0.907
Model: OLS Adj. R-squared: 0.905
Method: Least Squares F-statistic: 761.9
Date: Wed, 19 Jun 2019 Prob (F-statistic): 1.50e-81
Time: 22:41:35 Log-Likelihood: -297.40
No. Observations: 160 AIC: 600.8
Df Residuals: 157 BIC: 610.0
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 14.2175 0.124 114.754 0.000 13.973 14.462
TV 3.7820 0.124 30.401 0.000 3.536 4.028
radio 2.8766 0.124 23.123 0.000 2.631 3.122
==============================================================================
Omnibus: 13.633 Durbin-Watson: 2.040
Prob(Omnibus): 0.001 Jarque-Bera (JB): 15.256
Skew: -0.756 Prob(JB): 0.000487
Kurtosis: 3.000 Cond. No. 1.05
==============================================================================
stats_model1都通过了检验,因此表达式可以写成:sales = 3.7820tv + 2.8766radio + 14.2175,接着通过sklearn的linear_model构建模型。
#数据预处理
x = data.iloc[:,:-2]
y = data.iloc[:,-1:]
x = (x-x.mean())/x.std()
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.2,
shuffle = True,random_state = 0)
#sklearn的最小二乘法
lr = LinearRegression()
lr.fit(x_train,y_train)
result_lr = lr.predict(x_test)
print('r2_score:{}'.format(r2_score(y_test,result_lr))) #R 方
print('coef:{}'.format(lr.coef_))
print('intercept:{}'.format(lr.intercept_))
r2_score:0.8604541663186569
coef:[[3.82192087 2.89820718]]
intercept:[14.03854759]
sklearn构建的模型表达式为sales=3.8279tv+2.8922radio+14.0385。
接着通过根据最小二乘法求参数最优解, ,自定义一个最小二乘法求解函数,
因为涉及到矩阵的运算,因此一开始先将数据集转换为矩阵的格式。
#手写最小二乘法
def ols_linear_model(x_train,x_test,y_train,y_test):
x_train.insert(0,'b',[1]*len(x_train)) #为了运算的方便,将x0设为1
x_test.insert(0,'b',[1]*len(x_test))
x_train = np.matrix(x_train)
y_train = np.matrix(y_train)
x_test = np.matrix(x_test)
#下面涉及到矩阵的求逆,因此先判断是否可逆
if np.linalg.det(x_train.T*x_train) == 0:
print('奇异矩阵,不可逆')
else:
#最优参数求解
weights = np.linalg.inv(x_train.T*x_train)*x_train.T*y_train
#预测
y_predict = x_test*weights
print('r2_score:{}'.format(r2_score(y_test,y_predict)))
print('coef:{}'.format(weights[1:]))
#因为x0为1,因此第一个参数就是截距
print('intercept:{}'.format(weights[0]))
#结果
ols_linear_model(x_train,x_test,y_train,y_test)
r2_score:0.860454166318657
coef:[[3.82192087]
[2.89820718]]
intercept:[[14.03854759]]
手写最小二乘法构建的模型表达式为sales=3.8219tv+2.8982radio+14.0385。通过手写最小二乘法,我们可知最小二乘法的缺点是,当xTx不可逆时,也就是说x不是满秩矩阵时无解,此时就不能通过最小二乘法求得最优解。于是有了优化算法--梯度下降法。
梯度下降法根据每次更新参数所使用数据集的数量可分为:批量梯度下降法、随机梯度下降法和小批量梯度下降法。
#批量梯度下降:
def gradient_desc(x_train, y_train,x_test,alpha, max_itor):
x_train = np.array(x_train)
x_test = np.array(x_test)
y_train = np.array(y_train).flatten()
theta = np.zeros(x_train.shape[1])
episilon = 1e-8
iter_count = 0
loss = 10
#当损失函数达到阈值或者达到最大迭代次数停止循环
while loss > episilon and iter_count < max_itor:
loss = 0
iter_count+=1
#梯度(使用训练集所有数据)
gradient = x_train.T.dot(x_train.dot(theta) - y_train)/ len(y_train)
theta = theta - alpha * gradient
#损失函数
loss = np.sum((y_train - np.dot(x_train, theta))**2) / (2*len(y_train))
y_predict = x_test.dot(theta)
print('r2_score:{}'.format(r2_score(y_test,y_predict)))
print('coef:{}'.format(theta[1:]))
print('intercept:{}'.format(theta[0]))
#结果
gradient_desc(x_train, y_train,x_test,alpha=0.001, max_itor=10000)
r2_score:0.8604634817515153
coef:[3.82203058 2.8981221 ]
intercept:14.037935836020237
构建的模型表达式为sales=3.8220tv+2.8981radio+14.0379。通过手写批量梯度下降法,我们可知它的缺点就在于每次更新参数都需要使用训练集所有的数据,一旦训练集的数据量很大时,运算的时间就很长。于是出现了优化的算法--随机梯度下降。
#随机梯度下降:
def s_gradient_desc(x_train, y_train,x_test,alpha, max_itor):
x_train = np.array(x_train)
x_test = np.array(x_test)
y_train = np.array(y_train).flatten()
theta = np.zeros(x_train.shape[1])
episilon = 1e-8
iter_count = 0
loss = 10
#当损失函数达到阈值或者达到最大迭代次数停止循环:
while loss > episilon and iter_count < max_itor:
loss = 0
iter_count+=1
rand_i = np.random.randint(len(x_train))
#梯度(使用训练集某一数据):
gradient = x_train[rand_i].T.dot(x_train[rand_i].dot(theta) - y_train[rand_i])
theta = theta - alpha * gradient
#损失函数:
loss = np.sum((y_train - np.dot(x_train, theta))**2) / (2*len(y_train))
y_predict = x_test.dot(theta)
print('r2_score:{}'.format(r2_score(y_test,y_predict)))
print('coef:{}'.format(theta[1:]))
print('intercept:{}'.format(theta[0]))
print('iter_count:{}'.format(iter_count))
#结果
s_gradient_desc(x_train, y_train,x_test,alpha=0.001, max_itor=10000)
r2_score:0.8607601654222723
coef:[3.83573278 2.90238477]
intercept:14.036801544903055
构建的模型表达式为sales=3.8357tv+2.9023radio+14.0368。通过构建随机梯度下降法,我们可知它的缺点就在于每次更新参数都只需要训练集中某一数据,得到的有可能是局部最优解。于是综合批量梯度下降法和随机梯度下降法的优势,得到小批量梯度下降法。
#小批量梯度下降:
def sb_gradient_desc(x_train, y_train,x_test,alpha,num,max_itor):
x_train = np.array(x_train)
x_test = np.array(x_test)
y_train = np.array(y_train).flatten()
theta = np.zeros(x_train.shape[1])
episilon = 1e-8
iter_count = 0
loss = 10
#当损失函数达到阈值或者达到最大迭代次数停止循环:
while loss > episilon and iter_count < max_itor:
loss = 0
iter_count+=1
rand_i = np.random.randint(0,len(x_train),num)
#梯度(使用训练集某一部份数据):
gradient = x_train[rand_i].T.dot(x_train[rand_i].dot(theta) - y_train[rand_i])/num
theta = theta - alpha * gradient
#损失函数:
loss = np.sum((y_train - np.dot(x_train, theta))**2) / (2*len(y_train))
y_predict = x_test.dot(theta)
print('r2_score:{}'.format(r2_score(y_test,y_predict)))
print('coef:{}'.format(theta[1:]))
print('intercept:{}'.format(theta[0]))
print('iter_count:{}'.format(iter_count))
#结果
sb_gradient_desc(x_train, y_train,x_test,alpha=0.001,num=20,max_itor=10000)
r2_score:0.860623250516056
coef:[3.82871666 2.89894667]
intercept:14.042705519319549
构建的模型表达式为sales=3.8287tv+2.8989radio+14.0427。
总结一下:
— end —
小文的数据之旅
戳右上角「+关注」获取最新share
如果喜欢,请分享or点赞