这篇博客用来记录初学 普通最小二乘回归 遇到的相关知识点和解决问题的过程。
开发环境:Pycharm 2018.1.2
版本:Python 2.7.14 :: Anaconda, Inc.
普通最小二乘法回归
数据集:Cal_housing.csv
简 介:从 1990 年至今,美国加州所有街区人口普查的信息,关于 9 组变量,共 20640 个观测值。
Variables | Bols | tols |
---|---|---|
INTERCEPT (截距) | 11.4939 | 275.7518 |
MEDIAN INCOME (收入中值) | 0.4790 | 45.7768 |
MEDIAN INCOME2 (收入中值2) | -0.0166 | -9.4841 |
MEDIAN INCOME3 (收入中值3) | -0.0002 | -1.9157 |
ln(MEDIAN AGE) (年龄中位数) | 0.1570 | 33.6123 |
ln(TOTAL ROOMS/ POPULATION) (总房屋数/人口) | -0.8582 | -56.1280 |
ln(BEDROOMS/ POPULATION) (卧室/人口) | 0.8043 | 38.0685 |
ln(POPULATION/ HOUSEHOLDS) (人口/家庭) | -0.4077 | -20.8762 |
ln(HOUSEHOLDS) (家庭) | 0.0477 | 13.0792 |
用下面代码读入数据, 并弄清楚哪些是自变量哪个是因变量:
import pandas as pd
import numpy as np
data = pd.read_csv("cal_housing.csv")
name = data.columns
X = data[name[:8]] # 第1-8列
y = data[name[8:9]] # 第9列
print("X name :", name[:8])
print("y name :", name[8:9])
print(data.shape, X.shape, y.shape) # 返回行列数
---------------------------------------------
('X name :', Index([u'longitude', u'latitude', u'housingMedianAge', u'totalRooms',
u'totalBedrooms', u'population', u'households', u'medianIncome'],
dtype='object'))
('y name :', Index([u'medianHouseValue'], dtype='object'))
((20640, 9), (20640, 8), (20640, 1))
把数据随机分成训练集和测试集
可自己决定随机种子(多少位数都可以)和测试集百分比(小于0.5即小于50%)
seed = 8888 # 随机种子
proportion = 0.1 # 测试集百分比
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=proportion, random_state=seed)
print(X_train.shape,X_test.shape, y_train.shape, y_test.shape)
---------------------------------------------------
((18576, 8), (2064, 8), (18576, 1), (2064, 1))
做回归, 并求出 R^2 和 N M S E
reg = LinearRegression() # 线性回归(Linear Regression)
res = reg.fit(X_train, y_train) # 对训练集X_train, y_train进行训练
y_hat = res.predict(X_test) # 使用训练得到的估计器对输入为X_test的集合进行预测,得到y_hat
e = y_test-y_hat # 计算残差
SSE_cv = np.mean(e**2) # 残差平方和
SSE_test = np.mean((y_test-np.mean(y_test))**2) # 拍脑袋平方和
NMSE_cv = SSE_cv/SSE_test # 标准化均方误差 NMSE_cv
R2_cv = 1 - NMSE_cv # 可决系数R2_cv
print R2_cv
print NMSE_cv
-----------------------------------------------------------------
medianHouseValue 0.657186
dtype: float64
medianHouseValue 0.342814
dtype: float64
自己决定样本量(n), 自变量个数(p)和系数值(B), 自己决定正态误差的均值m和标准差s
seed = 8888 # 随机种子
n = 100 # 样本量
p = 7 # 自变量个数
m = 0 # 误差项均值
s = 5 # 标准差
B = [2, 5, 16, 9, -3, -5, -2] # beta值
C = [2, 2]
np.random.seed(seed)
X = np.random.normal(0, 1, (n, p))
y = X.dot(B)+np.random.normal(m, s, n)
print(X.shape, y.shape)
----------------------------------------------------------
((100L, 7L), (100L,))
实施回归
import statsmodels.api as sm
# 增加截距项
mod = sm.OLS(y, X) # 普通最小二乘模型,ordinary least square model
res = mod.fit()
#输出R^2
print("R^2:",res2.rsquared,"\nNMSE:",1-res2.rsquared)
----------------------------------------------------------
R^2: 0.92564484308
NMSE: 0.0743551569196
print (res2.summary())
---------------------------------
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.926
Model: OLS Adj. R-squared: 0.920
Method: Least Squares F-statistic: 165.4
Date: Mon, 07 May 2018 Prob (F-statistic): 1.32e-49
Time: 09:54:25 Log-Likelihood: -304.71
No. Observations: 100 AIC: 623.4
Df Residuals: 93 BIC: 641.7
Df Model: 7
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
x1 2.3477 0.532 4.415 0.000 1.292 3.404
x2 5.5829 0.511 10.929 0.000 4.568 6.597
x3 15.7619 0.596 26.444 0.000 14.578 16.946
x4 8.9595 0.521 17.181 0.000 7.924 9.995
x5 -3.3048 0.530 -6.233 0.000 -4.358 -2.252
x6 -4.9932 0.491 -10.175 0.000 -5.968 -4.019
x7 -2.0126 0.536 -3.754 0.000 -3.077 -0.948
==============================================================================
Omnibus: 0.577 Durbin-Watson: 1.970
Prob(Omnibus): 0.749 Jarque-Bera (JB): 0.227
Skew: -0.078 Prob(JB): 0.893
Kurtosis: 3.174 Cond. No. 1.51
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
源码以及数据下载