研究一个因变量、与两个或两个以上自变量的回归。亦称为多元线性回归,是反映一种现象或事物的数量依多种现象或事物的数量的变动而相应地变动的规律。建立多个变量之间线性或非线性数学模型数量关系式的统计方法。
相关数据:
链接: https://pan.baidu.com/s/1Qv9OieI5R5zu-jbKU3bLZg?
pwd=eyzh 提取码: eyzh
复制这段内容后打开百度网盘手机App,操作更方便哦
相关概念这里不做过多的解释,需要的可以自行查找,这里只提供机器学习该模型的用法:
以预测波士顿房价为例:
1.获取数据:"D:\mlData\house_data.csv"文件存放的地址,df.head()指定记录数
# 1、读取数据
df=pd.read_csv("D:\mlData\house_data.csv")
df.head(10) #指定前十条记录数
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 18.0 | 2.31 | 0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1 | 296 | 15.3 | 396.90 | 4.98 | 24.0 |
1 | 0.02731 | 0.0 | 7.07 | 0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2 | 242 | 17.8 | 396.90 | 9.14 | 21.6 |
2 | 0.02729 | 0.0 | 7.07 | 0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2 | 242 | 17.8 | 392.83 | 4.03 | 34.7 |
3 | 0.03237 | 0.0 | 2.18 | 0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3 | 222 | 18.7 | 394.63 | 2.94 | 33.4 |
4 | 0.06905 | 0.0 | 2.18 | 0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3 | 222 | 18.7 | 396.90 | 5.33 | 36.2 |
5 | 0.02985 | 0.0 | 2.18 | 0 | 0.458 | 6.430 | 58.7 | 6.0622 | 3 | 222 | 18.7 | 394.12 | 5.21 | 28.7 |
6 | 0.08829 | 12.5 | 7.87 | 0 | 0.524 | 6.012 | 66.6 | 5.5605 | 5 | 311 | 15.2 | 395.60 | 12.43 | 22.9 |
7 | 0.14455 | 12.5 | 7.87 | 0 | 0.524 | 6.172 | 96.1 | 5.9505 | 5 | 311 | 15.2 | 396.90 | 19.15 | 27.1 |
8 | 0.21124 | 12.5 | 7.87 | 0 | 0.524 | 5.631 | 100.0 | 6.0821 | 5 | 311 | 15.2 | 386.63 | 29.93 | 16.5 |
9 | 0.17004 | 12.5 | 7.87 | 0 | 0.524 | 6.004 | 85.9 | 6.5921 | 5 | 311 | 15.2 | 386.71 | 17.10 | 18.9 |
2.数据特征工程处理:ydata提取“MEDV”的数据,xdata删除“MEDV”的数据并删除整一列
import matplotlib.pyplot as plt
ydata=df['MEDV']
xdata=df.drop('MEDV',axis=1)
xdata
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 18.0 | 2.31 | 0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1 | 296 | 15.3 | 396.90 | 4.98 |
1 | 0.02731 | 0.0 | 7.07 | 0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2 | 242 | 17.8 | 396.90 | 9.14 |
2 | 0.02729 | 0.0 | 7.07 | 0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2 | 242 | 17.8 | 392.83 | 4.03 |
3 | 0.03237 | 0.0 | 2.18 | 0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3 | 222 | 18.7 | 394.63 | 2.94 |
4 | 0.06905 | 0.0 | 2.18 | 0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3 | 222 | 18.7 | 396.90 | 5.33 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
501 | 0.06263 | 0.0 | 11.93 | 0 | 0.573 | 6.593 | 69.1 | 2.4786 | 1 | 273 | 21.0 | 391.99 | 9.67 |
502 | 0.04527 | 0.0 | 11.93 | 0 | 0.573 | 6.120 | 76.7 | 2.2875 | 1 | 273 | 21.0 | 396.90 | 9.08 |
503 | 0.06076 | 0.0 | 11.93 | 0 | 0.573 | 6.976 | 91.0 | 2.1675 | 1 | 273 | 21.0 | 396.90 | 5.64 |
504 | 0.10959 | 0.0 | 11.93 | 0 | 0.573 | 6.794 | 89.3 | 2.3889 | 1 | 273 | 21.0 | 393.45 | 6.48 |
505 | 0.04741 | 0.0 | 11.93 | 0 | 0.573 | 6.030 | 80.8 | 2.5050 | 1 | 273 | 21.0 | 396.90 | 7.88 |
506 rows × 13 columns
3.数据集划分:对数据集进行划分一般为训练数据集与测试数据集是8:2
#3.数据集的划分
from sklearn.model_selection import train_test_split
xtrain,xtest,ytrain,ytest=train_test_split(xdata,ydata,test_size=0.2,random_state=33)
print(ytrain,ytest,xtrain,xtest)
229 31.5 296 27.1 425 8.3 491 13.6 418 8.8 ... 146 15.6 66 19.4 216 23.3 391 23.2 20 13.6 Name: MEDV, Length: 404, dtype: float64 122 20.5 400 5.6 423 13.4 447 12.6 44 21.2 ... 165 25.0 106 19.5 470 19.9 149 15.4 110 21.7 Name: MEDV, Length: 102, dtype: float64 CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX \ 229 0.44178 0.0 6.20 0 0.504 6.552 21.4 3.3751 8 307 296 0.05372 0.0 13.92 0 0.437 6.549 51.0 5.9604 4 289 425 15.86030 0.0 18.10 0 0.679 5.896 95.4 1.9096 24 666 491 0.10574 0.0 27.74 0 0.609 5.983 98.8 1.8681 4 711 418 73.53410 0.0 18.10 0 0.679 5.957 100.0 1.8026 24 666 .. ... ... ... ... ... ... ... ... ... ... 146 2.15505 0.0 19.58 0 0.871 5.628 100.0 1.5166 5 403 66 0.04379 80.0 3.37 0 0.398 5.787 31.1 6.6115 4 337 216 0.04560 0.0 13.89 1 0.550 5.888 56.0 3.1121 5 276 391 5.29305 0.0 18.10 0 0.700 6.051 82.5 2.1678 24 666 20 1.25179 0.0 8.14 0 0.538 5.570 98.1 3.7979 4 307 PTRATIO B LSTAT 229 17.4 380.34 3.76 296 16.0 392.85 7.39 425 20.2 7.68 24.39 491 20.1 390.11 18.07 418 20.2 16.45 20.62 .. ... ... ... 146 14.7 169.27 16.65 66 16.1 396.90 10.24 216 16.4 392.80 13.51 391 20.2 378.38 18.76 20 21.0 376.57 21.02 [404 rows x 13 columns] CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX \ 122 0.09299 0.0 25.65 0 0.581 5.961 92.9 2.0869 2 188 400 25.04610 0.0 18.10 0 0.693 5.987 100.0 1.5888 24 666 423 7.05042 0.0 18.10 0 0.614 6.103 85.1 2.0218 24 666 447 9.92485 0.0 18.10 0 0.740 6.251 96.6 2.1980 24 666 44 0.12269 0.0 6.91 0 0.448 6.069 40.0 5.7209 3 233 .. ... ... ... ... ... ... ... ... ... ... 165 2.92400 0.0 19.58 0 0.605 6.101 93.0 2.2834 5 403 106 0.17120 0.0 8.56 0 0.520 5.836 91.9 2.2110 5 384 470 4.34879 0.0 18.10 0 0.580 6.167 84.0 3.0334 24 666 149 2.73397 0.0 19.58 0 0.871 5.597 94.9 1.5257 5 403 110 0.10793 0.0 8.56 0 0.520 6.195 54.4 2.7778 5 384 PTRATIO B LSTAT 122 19.1 378.09 17.93 400 20.2 396.90 26.77 423 20.2 2.52 23.29 447 20.2 388.52 16.44 44 17.9 389.39 9.55 .. ... ... ... 165 14.7 240.16 9.81 106 20.9 395.67 18.66 470 20.2 396.90 16.29 149 14.7 351.85 21.45 110 20.9 393.49 13.00 [102 rows x 13 columns]
4. 模型训练:得出截距为33.046064463200565
#模型训练--预估器
#xtrain
#ytrain
#导入线性回归库
from sklearn.linear_model import LinearRegression
#创建回归对象
lr= LinearRegression()
#模型训练--a权重 b截距
lr.fit(xtrain,ytrain)
#权重系数
#y=w0+w1*x+w2*x+.....wn*x
#求 w0
#求 w1.....
lr.coef_
#截距
lr.intercept_
5.模型预测:对测试数据进行模型预测
#模型的预测
y_predict=lr.predict(xtest)
y_predict
#真实数据
ytest
122 20.5 400 5.6 423 13.4 447 12.6 44 21.2 ... 165 25.0 106 19.5 470 19.9 149 15.4 110 21.7 Name: MEDV, Length: 102, dtype: float64
6.模型评估
#模型评估 MSE
from sklearn.metrics import mean_squared_error
mse=mean_squared_error(ytest,y_predict)
mse