import numpy as np
from sklearn import datasets
#加载sklearn-datasets数据集中的波士顿房产数据
boston = datasets.load_boston()
X = boston.data #所有特征数据
y = boston.target #输出标记
#绘制二维的散点图会发现数据中存在一些垃圾数据,将其清除
X = X[y < 50.0]
y = y[y < 50.0]
#从sklearn中导入线性回归算法的package
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression() #实例化
lin_reg.fit(X, y) #fit拟合操作
#此时我们查看模型的参数coef_
#关于模型中的 coef_ 和 interception_ 参数是什么,传送门:https://www.jianshu.com/p/6a818b53a37e
In[1]: lin_reg.coef_
Out[1]:array([-1.06715912e-01, 3.53133180e-02, -4.38830943e-02, 4.52209315e-01,
-1.23981083e+01, 3.75945346e+00, -2.36790549e-02, -1.21096549e+00,
2.51301879e-01, -1.37774382e-02, -8.38180086e-01, 7.85316354e-03,
-3.50107918e-01])
In[2]:np.argsort(lin_reg.coef_) #对coef_进行索引排序
Out[2]:array([ 4, 7, 10, 12, 0, 2, 6, 9, 11, 1, 8, 3, 5])
In[3]:boston.feature_names #查看波士顿房产数据中特征对应的名称
Out[3]:array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype=')
In[4]:boston.feature_names[np.argsort(lin_reg.coef_)] #对特征名也进行上面一样的索引排序
Out[4]:array(['NOX', 'DIS', 'PTRATIO', 'LSTAT', 'CRIM', 'INDUS', 'AGE', 'TAX',
'B', 'ZN', 'RAD', 'CHAS', 'RM'], dtype=')
In[5]:print(boston.DESCR) #我们输出数据的文档,查看每个特征名对应的意义
Out[5]:.. _boston_dataset:
Boston house prices dataset
---------------------------
**Data Set Characteristics:**
:Number of Instances: 506
:Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.
:Attribute Information (in order):
- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per $10,000
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000's
:Missing Attribute Values: None
:Creator: Harrison, D. and Rubinfeld, D.L.
This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/
This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980. N.B. Various transformations are used in the table on
pages 244-261 of the latter.
The Boston house-price data has been used in many machine learning papers that address regression
problems.
.. topic:: References
- Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
我们看到索引排序后的特征名称:
[‘NOX’, ‘DIS’, ‘PTRATIO’, ‘LSTAT’, ‘CRIM’, ‘INDUS’, ‘AGE’, ‘TAX’, ‘B’, ‘ZN’, ‘RAD’, ‘CHAS’, ‘RM’]
对照参考文档中我们能知道 权值最大的RM指的是 average number of rooms per dwelling(每套住宅的平均房间数),所以我们知道,房间数越多的房屋房价越高。
第二大的特征是 CHAS :Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) 表示房屋是临近Charles河还是不临河(临河为1,不临河为0)。通过我们的学习,CHAS排在第二大的位置,相当于是临河的房子房价越高,不临河的房子房价就比较低。
以此类推…
我们再看负相关的那头,
NOX:nitric oxides concentration (parts per 10 million) 一氧化氮浓度(每1000万份),一氧化氮是一种有害气体,因为NOX是在负相关那头,所以我们能够知道NOX越高房价越低。
以此类推…
这就是线性回归对我们的数据具有可解释性
更重要的是,当我们获得了这种可解释性之后,我们可以有针对地去采集更多的特征来更好的描述我们的房价。
例如我们知道了房间数量和房价是正相关的,很大程度上房间的数量也决定了房子的大小,包括房子有多少平,有多少层,院子有多大等等,我们是不是就能采集这些特征来进行预测,看会不会得到更好的预测波士顿房价的模型。
再例如NOX和房价成负相关,那么我们就能进一步去采集房屋附近是否有产生一氧化氮的工厂等去更好地预测波士顿房价。