机器学习—多元线性回归案例

        研究一个因变量、与两个或两个以上自变量的回归。亦称为多元线性回归,是反映一种现象或事物的数量依多种现象或事物的数量的变动而相应地变动的规律。建立多个变量之间线性或非线性数学模型数量关系式的统计方法。

相关数据:

链接: https://pan.baidu.com/s/1Qv9OieI5R5zu-jbKU3bLZg?

pwd=eyzh 提取码: eyzh

复制这段内容后打开百度网盘手机App,操作更方便哦

相关概念这里不做过多的解释,需要的可以自行查找,这里只提供机器学习该模型的用法:

以预测波士顿房价为例:

1.获取数据:"D:\mlData\house_data.csv"文件存放的地址,df.head()指定记录数


# 1、读取数据
df=pd.read_csv("D:\mlData\house_data.csv")

df.head(10) #指定前十条记录数
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
5 0.02985 0.0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 28.7
6 0.08829 12.5 7.87 0 0.524 6.012 66.6 5.5605 5 311 15.2 395.60 12.43 22.9
7 0.14455 12.5 7.87 0 0.524 6.172 96.1 5.9505 5 311 15.2 396.90 19.15 27.1
8 0.21124 12.5 7.87 0 0.524 5.631 100.0 6.0821 5 311 15.2 386.63 29.93 16.5
9 0.17004 12.5 7.87 0 0.524 6.004 85.9 6.5921 5 311 15.2 386.71 17.10 18.9

2.数据特征工程处理:ydata提取“MEDV”的数据,xdata删除“MEDV”的数据并删除整一列

import matplotlib.pyplot as plt
ydata=df['MEDV']
xdata=df.drop('MEDV',axis=1) 
xdata
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
... ... ... ... ... ... ... ... ... ... ... ... ... ...
501 0.06263 0.0 11.93 0 0.573 6.593 69.1 2.4786 1 273 21.0 391.99 9.67
502 0.04527 0.0 11.93 0 0.573 6.120 76.7 2.2875 1 273 21.0 396.90 9.08
503 0.06076 0.0 11.93 0 0.573 6.976 91.0 2.1675 1 273 21.0 396.90 5.64
504 0.10959 0.0 11.93 0 0.573 6.794 89.3 2.3889 1 273 21.0 393.45 6.48
505 0.04741 0.0 11.93 0 0.573 6.030 80.8 2.5050 1 273 21.0 396.90 7.88

506 rows × 13 columns

3.数据集划分:对数据集进行划分一般为训练数据集与测试数据集是8:2

#3.数据集的划分
from sklearn.model_selection import train_test_split
xtrain,xtest,ytrain,ytest=train_test_split(xdata,ydata,test_size=0.2,random_state=33)
print(ytrain,ytest,xtrain,xtest)
229    31.5
296    27.1
425     8.3
491    13.6
418     8.8
       ... 
146    15.6
66     19.4
216    23.3
391    23.2
20     13.6
Name: MEDV, Length: 404, dtype: float64 122    20.5
400     5.6
423    13.4
447    12.6
44     21.2
       ... 
165    25.0
106    19.5
470    19.9
149    15.4
110    21.7
Name: MEDV, Length: 102, dtype: float64          CRIM    ZN  INDUS  CHAS    NOX     RM    AGE     DIS  RAD  TAX  \
229   0.44178   0.0   6.20     0  0.504  6.552   21.4  3.3751    8  307   
296   0.05372   0.0  13.92     0  0.437  6.549   51.0  5.9604    4  289   
425  15.86030   0.0  18.10     0  0.679  5.896   95.4  1.9096   24  666   
491   0.10574   0.0  27.74     0  0.609  5.983   98.8  1.8681    4  711   
418  73.53410   0.0  18.10     0  0.679  5.957  100.0  1.8026   24  666   
..        ...   ...    ...   ...    ...    ...    ...     ...  ...  ...   
146   2.15505   0.0  19.58     0  0.871  5.628  100.0  1.5166    5  403   
66    0.04379  80.0   3.37     0  0.398  5.787   31.1  6.6115    4  337   
216   0.04560   0.0  13.89     1  0.550  5.888   56.0  3.1121    5  276   
391   5.29305   0.0  18.10     0  0.700  6.051   82.5  2.1678   24  666   
20    1.25179   0.0   8.14     0  0.538  5.570   98.1  3.7979    4  307   

     PTRATIO       B  LSTAT  
229     17.4  380.34   3.76  
296     16.0  392.85   7.39  
425     20.2    7.68  24.39  
491     20.1  390.11  18.07  
418     20.2   16.45  20.62  
..       ...     ...    ...  
146     14.7  169.27  16.65  
66      16.1  396.90  10.24  
216     16.4  392.80  13.51  
391     20.2  378.38  18.76  
20      21.0  376.57  21.02  

[404 rows x 13 columns]          CRIM   ZN  INDUS  CHAS    NOX     RM    AGE     DIS  RAD  TAX  \
122   0.09299  0.0  25.65     0  0.581  5.961   92.9  2.0869    2  188   
400  25.04610  0.0  18.10     0  0.693  5.987  100.0  1.5888   24  666   
423   7.05042  0.0  18.10     0  0.614  6.103   85.1  2.0218   24  666   
447   9.92485  0.0  18.10     0  0.740  6.251   96.6  2.1980   24  666   
44    0.12269  0.0   6.91     0  0.448  6.069   40.0  5.7209    3  233   
..        ...  ...    ...   ...    ...    ...    ...     ...  ...  ...   
165   2.92400  0.0  19.58     0  0.605  6.101   93.0  2.2834    5  403   
106   0.17120  0.0   8.56     0  0.520  5.836   91.9  2.2110    5  384   
470   4.34879  0.0  18.10     0  0.580  6.167   84.0  3.0334   24  666   
149   2.73397  0.0  19.58     0  0.871  5.597   94.9  1.5257    5  403   
110   0.10793  0.0   8.56     0  0.520  6.195   54.4  2.7778    5  384   

     PTRATIO       B  LSTAT  
122     19.1  378.09  17.93  
400     20.2  396.90  26.77  
423     20.2    2.52  23.29  
447     20.2  388.52  16.44  
44      17.9  389.39   9.55  
..       ...     ...    ...  
165     14.7  240.16   9.81  
106     20.9  395.67  18.66  
470     20.2  396.90  16.29  
149     14.7  351.85  21.45  
110     20.9  393.49  13.00  

[102 rows x 13 columns]

4. 模型训练:得出截距为33.046064463200565

#模型训练--预估器
#xtrain
#ytrain
#导入线性回归库
from sklearn.linear_model import LinearRegression
#创建回归对象

lr= LinearRegression()
#模型训练--a权重 b截距
lr.fit(xtrain,ytrain)
#权重系数
#y=w0+w1*x+w2*x+.....wn*x
#求 w0
#求 w1.....
lr.coef_

#截距
lr.intercept_

5.模型预测:对测试数据进行模型预测

#模型的预测
y_predict=lr.predict(xtest)
y_predict
#真实数据
ytest
122    20.5
400     5.6
423    13.4
447    12.6
44     21.2
       ... 
165    25.0
106    19.5
470    19.9
149    15.4
110    21.7
Name: MEDV, Length: 102, dtype: float64

6.模型评估 

#模型评估 MSE
from sklearn.metrics import mean_squared_error
mse=mean_squared_error(ytest,y_predict)
mse

你可能感兴趣的:(机器学习,线性回归,人工智能)