算法梳理进阶线性回归 任务四: Linear Regression(多变量)

  1. 波士顿房产数据(完整数据)
  2. 实现多变量(手写代码)
  3. 数据标准化(手写代码)
  4. 网格搜索调参
    5 from sklearn.linear_model import LinearRegression对比

1、导入数据
from sklearn.datasets import load_boston
boston = load_boston()
print(boston.feature_names)
print(boston.DESCR)

输出:
Notes

Data Set Characteristics:

:Number of Instances: 506 

:Number of Attributes: 13 numeric/categorical predictive

:Median Value (attribute 14) is usually the target

:Attribute Information (in order):
    - CRIM     per capita crime rate by town
    - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
    - INDUS    proportion of non-retail business acres per town
    - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
    - NOX      nitric oxides concentration (parts per 10 million)
    - RM       average number of rooms per dwelling
    - AGE      proportion of owner-occupied units built prior to 1940
    - DIS      weighted distances to five Boston employment centres
    - RAD      index of accessibility to radial highways
    - TAX      full-value property-tax rate per $10,000
    - PTRATIO  pupil-teacher ratio by town
    - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
    - LSTAT    % lower status of the population
    - MEDV     Median value of owner-occupied homes in $1000's

:Missing Attribute Values: None

:Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. ‘Hedonic
prices and the demand for clean air’, J. Environ. Economics & Management,
vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, ‘Regression diagnostics
…’, Wiley, 1980. N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.

References

  • Belsley, Kuh & Welsch, ‘Regression diagnostics: Identifying Influential Data and Sources of Collinearity’, Wiley, 1980. 244-261.
  • Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
  • many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)

2、合并数据
import pandas as pd
import numpy as np
x= pd.DataFrame(boston.data,columns = boston.feature_names)
y= pd.DataFrame(boston.target,columns=[‘Price’])
#x.head()
#y.head()
df = pd.concat([x,y],axis =1)#axis =1 为横向操作,即横向合并
df.head()

输出:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0

PTRATIO B LSTAT Price
0 15.3 396.90 4.98 24.0
1 17.8 396.90 9.14 21.6
2 17.8 392.83 4.03 34.7
3 18.7 394.63 2.94 33.4
4 18.7 396.90 5.33 36.2

3、标准化
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
bostonStd_1 = scaler.fit_transform(df.values)
#scale标准化是把数据归化到正态分布,即,均值为0(去均值的中心化),方差为1(方差的规模化)

bostonStd_1.shape#506,14
showbostonStd_1 = pd.DataFrame(bostonStd_1, columns =df.columns)
showbostonStd_1.head()

4、模型训练
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test =
train_test_split(bostonStd_1[:,[0,13]],bostonStd_1[:,[13]], test_size = 0.3, random_state=0)
#分别切分X(标准化后的数据 bostonStd_1所有行 前13列),y(标准化后的数据 bostonStd_1 所有行 第14列))为train,test两部分,
#train部分的比例是30%

from sklearn.linear_model import LinearRegression
regr = LinearRegression()
model = regr.fit(X_train, y_train)
#用训练得出的模型进行预测
y_pred = regr.predict(X_test)
#print(y_pred)
print(model)
print(‘截距:{},系数:{}’.format(regr.intercept_,regr.coef_))

from sklearn.metrics import mean_squared_error #均方误差
from sklearn.metrics import mean_absolute_error #平方绝对误差
from sklearn.metrics import r2_score #R square
#调用函数获得结果
mse_test1 = mean_squared_error(y_test,y_pred)
mae_test1 = mean_absolute_error(y_test,y_pred)
rmse_test1 = mse_test1 ** 0.5
r2_score1 = r2_score(y_test,y_pred)
print(‘均方误差:{},平均绝对误差:{},\n均方根误差:{},可决系数:{}’.format(mse_test1,mae_test1,rmse_test1,r2_score1))

输出:
均方误差:3.993103330409359e-32,平均绝对误差:1.44564093882883e-16,
均方根误差:1.9982750887726542e-16,可决系数:1.0

5、网格搜索调参

from sklearn.grid_search import GridSearchCV
from sklearn.neighbors import KNeighborsRegressor

knn_reg = KNeighborsRegressor()
grid_param = [
{
‘weights’:[‘uniform’],
‘n_neighbors’:[i for i in range(1,11)]
},
{
‘weights’:[‘distance’],
‘n_neighbors’:[i for i in range(1,11)],
‘p’:[i for i in range(1,6)]
}
]
knn_reg = KNeighborsRegressor()
grid_search = GridSearchCV(knn_reg,grid_param,n_jobs=-1,verbose=2,cv=10)

grid_search.fit(X_train,y_train.astype(‘int’))

print(grid_search.best_params_)
print(grid_search.best_score_)
print(grid_search.best_estimator_.score(X_test,y_test))

你可能感兴趣的:(datawhale,算法梳理进阶)