多元线性回归也就相当于NN的一层,y=wx+b,其中w和x是>1维的同维向量,也就是用输入的特征x1,x2,…去使用参数w和b预测y值。
import pandas as pd
import matplotlib as plt
%matplotlib inline
# Kaggle房价的train数据
df = pd.read_csv("./data/hp_train.csv")
df.head()
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
5 rows × 81 columns
import numpy as np
# 处理极端值
train = df[df['GarageArea'] < 1200]
# 处理缺失值:对于数值形式的数据,先用默认interpolate()进行插值,再删除那些有NaN的行
train = train.select_dtypes(include=[np.number]).interpolate().dropna()
train.head()
Id | MSSubClass | LotFrontage | LotArea | OverallQual | OverallCond | YearBuilt | YearRemodAdd | MasVnrArea | BsmtFinSF1 | ... | WoodDeckSF | OpenPorchSF | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | MiscVal | MoSold | YrSold | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 60 | 65.0 | 8450 | 7 | 5 | 2003 | 2003 | 196.0 | 706 | ... | 0 | 61 | 0 | 0 | 0 | 0 | 0 | 2 | 2008 | 208500 |
1 | 2 | 20 | 80.0 | 9600 | 6 | 8 | 1976 | 1976 | 0.0 | 978 | ... | 298 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 2007 | 181500 |
2 | 3 | 60 | 68.0 | 11250 | 7 | 5 | 2001 | 2002 | 162.0 | 486 | ... | 0 | 42 | 0 | 0 | 0 | 0 | 0 | 9 | 2008 | 223500 |
3 | 4 | 70 | 60.0 | 9550 | 7 | 5 | 1915 | 1970 | 0.0 | 216 | ... | 0 | 35 | 272 | 0 | 0 | 0 | 0 | 2 | 2006 | 140000 |
4 | 5 | 60 | 84.0 | 14260 | 8 | 5 | 2000 | 2000 | 350.0 | 655 | ... | 192 | 84 | 0 | 0 | 0 | 0 | 0 | 12 | 2008 | 250000 |
5 rows × 38 columns
# 取出特征和预测值
x = train.iloc[:,1:37]
y = train.iloc[:,-1]
import keras
# 初始化model
model = keras.Sequential()
Using TensorFlow backend.
# 添加全连接层(输出维度是1,输入维度是36)
from keras import layers
model.add(layers.Dense(1, input_dim=36))
WARNING:tensorflow:From E:\MyProgram\Anaconda\envs\krs\lib\site-packages\tensorflow\python\framework\op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_1 (Dense) (None, 1) 37
=================================================================
Total params: 37
Trainable params: 37
Non-trainable params: 0
_________________________________________________________________
# 编译model,指明优化器和优化目标
model.compile(
optimizer='adam',
loss='mse'
)
# 训练模型
model.fit(x, y, epochs=3000, verbose=0)
WARNING:tensorflow:From E:\MyProgram\Anaconda\envs\krs\lib\site-packages\tensorflow\python\ops\math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
model.predict(x)
array([[200142.2 ],
[188487.61],
[205551.75],
...,
[218197.34],
[115874.6 ],
[191411.95]], dtype=float32)
y.head()
0 208500
1 181500
2 223500
3 140000
4 250000
Name: SalePrice, dtype: int64
# Kaggle房价的test数据
df = pd.read_csv("./data/hp_test.csv")
# 取数字列,删除id列,进行插值(和训练集相同的预处理)
test = df.select_dtypes(include=np.number).drop(["Id"], axis=1).interpolate()
test.head()
MSSubClass | LotFrontage | LotArea | OverallQual | OverallCond | YearBuilt | YearRemodAdd | MasVnrArea | BsmtFinSF1 | BsmtFinSF2 | ... | GarageArea | WoodDeckSF | OpenPorchSF | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | MiscVal | MoSold | YrSold | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 20 | 80.0 | 11622 | 5 | 6 | 1961 | 1961 | 0.0 | 468.0 | 144.0 | ... | 730.0 | 140 | 0 | 0 | 0 | 120 | 0 | 0 | 6 | 2010 |
1 | 20 | 81.0 | 14267 | 6 | 6 | 1958 | 1958 | 108.0 | 923.0 | 0.0 | ... | 312.0 | 393 | 36 | 0 | 0 | 0 | 0 | 12500 | 6 | 2010 |
2 | 60 | 74.0 | 13830 | 5 | 5 | 1997 | 1998 | 0.0 | 791.0 | 0.0 | ... | 482.0 | 212 | 34 | 0 | 0 | 0 | 0 | 0 | 3 | 2010 |
3 | 60 | 78.0 | 9978 | 6 | 6 | 1998 | 1998 | 20.0 | 602.0 | 0.0 | ... | 470.0 | 360 | 36 | 0 | 0 | 0 | 0 | 0 | 6 | 2010 |
4 | 120 | 43.0 | 5005 | 8 | 5 | 1992 | 1992 | 0.0 | 263.0 | 0.0 | ... | 506.0 | 0 | 82 | 0 | 0 | 144 | 0 | 0 | 1 | 2010 |
5 rows × 36 columns
# 计算预测值
predictions = model.predict(test)
# 生成提交csv
submissions = pd.DataFrame()
submissions["Id"] = df.Id
submissions["SalePrice"] = predictions
submissions.to_csv("./data/hp_upload.csv", index=False)