1.线性回归

1.线性回归的基本原理：

找到当训练数据集中y的预测值和其真实值的平方差最小的时候，所对应的w和b值

1.线性回归的一般公式：

式中：表示y的估计值，x[0],x[1],,,,x[p]为数据集变量的数量（这个公式的数据集有p个特征）；w和b为模型的参数。

3.示例：生成一个样本数量为100，特征数量为2的数据集，并分割成训练集和测试集，再用线性回归计算出w和b值：

#导入数据集拆分工具
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
import matplotlib.pyplot as plt
# #生成回归数据集
X,y = make_regression(n_samples=100,n_features=2,n_informative=2,random_state=38)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=8)
lr = LinearRegression()
lr.fit(X_train,y_train)

4.训练完成，打印结果：

print('系数:{}'.format(lr.coef_[:]),'截距{}'.format(lr.intercept_))

系数:[ 70.38592453 7.43213621] 截距-1.4210854715202004e-14
模型得分：
print(lr.score(X_train,y_train))
结果：1.0
print(lr.score(X_test,y_test))
结果：1.0

5.结果分析：

1.由于make_regression生成的数据集中数据点有2个特征，所以lr.coef_是一个二维数组。在本例中回归模型的方程可以表示为：
2.得分高的原因：没有噪音，但真实世界有很多噪音，并且数据会有很多特征，这会给线性模型带来不少困扰。

5.糖尿病情数据集测试模型

from sklearn.datasets import load_diabetes
#X, y = make_regression(n_samples=500,n_features=50,n_informative=25,noise=500,
                      #random_state=8)
X, y = load_diabetes().data, load_diabetes().target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 8)
lr = LinearRegression().fit(X_train, y_train)
print("训练数据集得分：{:.2f}".format(lr.score(X_train, y_train)))
print("测试数据集得分：{:.2f}".format(lr.score(X_test, y_test)))

打印结果：
训练数据集得分：0.53
测试数据集得分：0.46
分析：
1.真实数据比手工生成的数据复杂性高很多，使得线性回归的表现大幅度下降
2.线性回归自身的特点，非常容易出现过拟合现象，在训练集得分和测试集得分的大差异就是一个明确的信号
3.岭回归可以控制模型复杂度，是标准线性回归模型最常用的代替模型

2.岭回归

2.1岭回归的原理：

岭回归是一种能避免过拟合的线性模型，保留所有特征变量，但减少特征变量的系数值，让特征变量对预测结果的影响较小。

2.2 L2正则化：通过表刘特征变量，只是降低特征变量系数值避免产生过拟合的方法称为L2正则化

2.3 继续使用以上数据集测试岭回归的表现：

from sklearn.linear_model import Ridge
ridge = Ridge().fit(X_train, y_train)
print("训练数据集得分：{:.2f}".format(ridge.score(X_train, y_train)))
print("测试数据集得分：{:.2f}".format(ridge.score(X_test, y_test)))

打印结果：
训练数据集得分：0.43
测试数据集得分：0.43

2.4岭回归的参数调节

2.4.1将alpha默认值1改为10

ridge10 = Ridge(alpha=10).fit(X_train, y_train)
print("训练数据集得分：{:.2f}".format(ridge10.score(X_train, y_train)))
print("测试数据集得分：{:.2f}".format(ridge10.score(X_test, y_test)))

打印结果：
训练数据集得分：0.15
测试数据集得分：0.16
2.4.2将alpha默认值1改为0.1

ridge01 = Ridge(alpha=0.1).fit(X_train, y_train)
print("训练数据得分：{:.2f}".format(ridge01.score(X_train, y_train)))
print("测试数据的分：{:.2f}".format(ridge01.score(X_test, y_test)))

打印结果：
训练数据得分：0.52
测试数据的分：0.47

2.4使用图像观察alpha值对于模型的影响：

plt.plot(ridge.coef_, 's', label = 'Ridge alpha=1')
plt.plot(ridge10.coef_, '^', label = 'Ridge alpha=10')
plt.plot(ridge01.coef_, 'v', label = 'Ridge alpha=0.1')
plt.plot(lr.coef_, 'o', label = 'linear regression')
plt.xlabel("coefficient index")
plt.ylabel("coefficient magnitude")
plt.hlines(0,0, len(lr.coef_))
#plt.ylim(-25,25)
plt.legend(loc='best')

image.png

结果分析：
在图中，横轴代表的是coef_属性：x=0显示第一个特征变量的系数，x=1显示的是第二个特征变量的系数，以此类推，直到X=10.纵轴显示特征变量的系数量级。从图中我们不难看出，当alpha=10,时，特征变量系数大多在0附近；当alpha=1时，岭模型的特征变量系数普遍增大了。而当alpha=0.1时，特征变量系数就更大了，甚至大部分和线性回归的点重合了，线性回归无alpha，系数值非常大。

3.套索回归

3.1套索回归的原理：

和岭回归一样，套索回归也会将系数限制在非常接近0的范围内，但它进行限制的方式有一点不同，就是会导致在使用套索回归的时候，有一部分系数可能会等于0，这种方式称为L1正则化。

3.2继续用糖尿病数据验证套索回归：

from sklearn.linear_model import Lasso
lasso = Lasso().fit(X_train, y_train)
print("套索回归在训练数据集的得分：{:.2f}".format(lasso.score(X_train, y_train)))
print("套索回归在测试数据集的得分：{:.2f}".format(lasso.score(X_test, y_test)))
print("套索回归使用的特征数：{}".format(np.sum(lasso.coef_ != 0)))

打印结果：
套索回归在训练数据集的得分：0.36
套索回归在测试数据集的得分：0.37
套索回归使用的特征数：3

3.1套索回归的参数调整：将alpha的1.0默认值降低为0.1

#增加最大迭代次数的默认设置
#否则模型会提示我们增加最大迭代次数
lasso01 = Lasso(alpha=0.1, max_iter=100000).fit(X_train, y_train)
print("alpha=0.1时套索回归在训练数据集的得分：{:.2f}".format(lasso001.score(X_train, y_train)))
print("alpha=0.1时套索回归在测试数据集的得分：{:.2f}".format(lasso001.score(X_test, y_test)))
print("alpha=0.1时套索回归使用的特征数：{}".format(np.sum(lasso001.coef_ != 0)))

打印结果：
alpha=0.1时套索回归在训练数据集的得分：0.52
alpha=0.1时套索回归在测试数据集的得分：0.48
alpha=0.1时套索回归使用的特征数：7

3.2套索回归的参数调整：将alpha的1.0默认值降低为0.0001

lasso00001 = Lasso(alpha=0.0001, max_iter=100000).fit(X_train, y_train)
print("alpha=0.0001时套索回归在训练数据集的得分：{:.2f}".format(lasso00001.score(X_train, y_train)))
print("alpha=0.0001时套索回归在测试数据集的得分：{:.2f}".format(lasso00001.score(X_test, y_test)))
print("alpha=0.0001时套索回归使用的特征数：{}".format(np.sum(lasso00001.coef_ != 0)))

打印结果：
alpha=0.0001时套索回归在训练数据集的得分：0.53
alpha=0.0001时套索回归在测试数据集的得分：0.46
alpha=0.0001时套索回归使用的特征数：10

3.3套索回归和岭回归的对比：

plt.plot(lasso.coef_, 's', label="Lasso alpha=1")
plt.plot(lasso01.coef_, '^', label="Lasso alpha=0.1")
plt.plot(lasso00001.coef_, 'v', label="Lasso alpha=0.0001")
plt.plot(ridge01.coef_, 'o', label="Ridge alpha=0.1")
plt.legend(ncol=2,loc=(0,1.05))
#plt.ylim(-25,25)
plt.xlabel("Coefficient index")
plt.ylabel("Coefficient magnitude")