回归分析预测波士顿房价

上周完成了回归分析预测波士顿房价的工作，因数据太少，预测准确度波动很大。总结回归分析的流程，整理放在这个文章里。

一个需要明确的点是回归分析所研究的统计关系，是在进行大量的观测或试验后建立起的经验关系，并非一定是因果关系。

基本概念

explanatory/independent/predictor variables, risk factor, feature都是自变量x。

explained/dependent/predicted/responding/responding/experimental/outcome variables, label都是对象y。

数据来源

Python的Sklearn module中的boston house price数据，共506组房屋数据，每组数据13个features。对象y是房价中间值median value。属性信息如下：

- CRIM per capita crime rate by town

- ZN proportion of residential land zoned for lots over 25,000 sq.ft.

- INDUS proportion of non-retail business acres per town

- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

- NOX nitric oxides concentration (parts per 10 million)

- RM average number of rooms per dwelling

- AGE proportion of owner-occupied units built prior to 1940

- DIS weighted distances to five Boston employment centres

- RAD index of accessibility to radial highways

- TAX full-value property-tax rate per $10,000

- PTRATIO pupil-teacher ratio by town

- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town

- LSTAT % lower status of the population

回归分析流程：

1 根据观察、统计或者业务需求，确定自变量x和对象y之间的关系式，也就是回归方程，并对方程的可信度进行统计检验；

2 对自变量x做显著性检验，判断哪些对于对象y是有影响的因子。

3 利用所建立的回归方程来预测未来时刻的预报对象做出预测，并给出可信度区间。

Data preprocessing

Boston house price的数据比较规范，但各feature的数据各自尺度不同，需要做data scaling。具体scaling需要哪些准备或者明确的标准，我还需要在以后的研究中深入研究下。这次的数据，用了sklearn里面的sklearn.preprocessing.scale指令，默认是做standardisation的，也就是某个feature的值减去期望再初standard deviation。如果数据是right/left skewed，则可能需要做logarithmic transformation，但做log的前提是数据要大于0.

指令：

tra_data = pd.DataFrame(sklearn.preprocessing.scale(ori_data.data[tra_index]),columns=ori_data.feature_names)[features_selected]

X-validation

随机选80%的数据作为training set，剩余数据作为test set。

回归

回归采用线性回归，Lasso回归和岭回归三种方法，对比效果。使用方式一样

from sklearn.linear_model import LinearRegression,Lasso,Ridge

lasso = Lasso(alpha=alpha_for_lasso)

lasso.fit(tra_data,tra_label)

三步就训练出了Lasso模型，再使用pred_label_lasso = lasso.predict(test_data)就会得到预测的label值。

性能检测

不同于分类，回归分析方法以MSE为主要measurement metrics，

mse_lasso = sklearn.metrics.mean_squared_error(test_label,pred_label_lasso)

MSE表示均方误差，越小越好。

另外R2指标展示的是模型(或回归参数w)对数据的拟合程度，这个数值在0到1之间，越大越好。

from sklearn.metrics import r2_score as r2

r2_lasso = r2(test_label,pred_label_lasso)

feature对label的显著性和feature间的multi collinearity

最简单的方法是计算features和label的相关性和features间彼此的相关性。

可使用F检验计算各feature间的F值和p值。

可使用recursive feature elimination RFE递归特征消除来选择feature。反复构建模型(SVM, Linear regression and etc.)选出最好/坏的特征(根据系数大小)，并排除，再对剩余features重复这个过程。

Lasso回归可自动使非重要feature的权重变为0，用于feature selection和避免multi collinearity都是可行的。

如何处理features比instance个数多的情况

feature多于instance的潜在风险是overfitting。在随机森林中，feature可以随机选取。Lasso回归自动执行了feature selection，可以用于处理这种情况。另外检查feature间的相关性，剔除冗余feature也可行。

更多笔记见笔记本