【Keras学习笔记】2:多元线性回归预测Kaggle房价

多元线性回归

训练模型

多元线性回归也就相当于NN的一层,y=wx+b,其中w和x是>1维的同维向量,也就是用输入的特征x1,x2,…去使用参数w和b预测y值。

import pandas as pd
import matplotlib as plt
%matplotlib inline
# Kaggle房价的train数据
df = pd.read_csv("./data/hp_train.csv")
df.head()
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 12 2008 WD Normal 250000

5 rows × 81 columns

import numpy as np
# 处理极端值
train = df[df['GarageArea'] < 1200]
# 处理缺失值:对于数值形式的数据,先用默认interpolate()进行插值,再删除那些有NaN的行
train = train.select_dtypes(include=[np.number]).interpolate().dropna()
train.head()
Id MSSubClass LotFrontage LotArea OverallQual OverallCond YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 ... WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold SalePrice
0 1 60 65.0 8450 7 5 2003 2003 196.0 706 ... 0 61 0 0 0 0 0 2 2008 208500
1 2 20 80.0 9600 6 8 1976 1976 0.0 978 ... 298 0 0 0 0 0 0 5 2007 181500
2 3 60 68.0 11250 7 5 2001 2002 162.0 486 ... 0 42 0 0 0 0 0 9 2008 223500
3 4 70 60.0 9550 7 5 1915 1970 0.0 216 ... 0 35 272 0 0 0 0 2 2006 140000
4 5 60 84.0 14260 8 5 2000 2000 350.0 655 ... 192 84 0 0 0 0 0 12 2008 250000

5 rows × 38 columns

# 取出特征和预测值
x = train.iloc[:,1:37]
y = train.iloc[:,-1]
import keras
# 初始化model
model = keras.Sequential()
Using TensorFlow backend.
# 添加全连接层(输出维度是1,输入维度是36)
from keras import layers
model.add(layers.Dense(1, input_dim=36))
WARNING:tensorflow:From E:\MyProgram\Anaconda\envs\krs\lib\site-packages\tensorflow\python\framework\op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_1 (Dense)              (None, 1)                 37        
=================================================================
Total params: 37
Trainable params: 37
Non-trainable params: 0
_________________________________________________________________
# 编译model,指明优化器和优化目标
model.compile(
    optimizer='adam',
    loss='mse'
)
# 训练模型
model.fit(x, y, epochs=3000, verbose=0)
WARNING:tensorflow:From E:\MyProgram\Anaconda\envs\krs\lib\site-packages\tensorflow\python\ops\math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.






model.predict(x)
array([[200142.2 ],
       [188487.61],
       [205551.75],
       ...,
       [218197.34],
       [115874.6 ],
       [191411.95]], dtype=float32)
y.head()
0    208500
1    181500
2    223500
3    140000
4    250000
Name: SalePrice, dtype: int64

生成预测值以提交到Kaggle

# Kaggle房价的test数据
df = pd.read_csv("./data/hp_test.csv")
# 取数字列,删除id列,进行插值(和训练集相同的预处理)
test = df.select_dtypes(include=np.number).drop(["Id"], axis=1).interpolate()
test.head()
MSSubClass LotFrontage LotArea OverallQual OverallCond YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 BsmtFinSF2 ... GarageArea WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold
0 20 80.0 11622 5 6 1961 1961 0.0 468.0 144.0 ... 730.0 140 0 0 0 120 0 0 6 2010
1 20 81.0 14267 6 6 1958 1958 108.0 923.0 0.0 ... 312.0 393 36 0 0 0 0 12500 6 2010
2 60 74.0 13830 5 5 1997 1998 0.0 791.0 0.0 ... 482.0 212 34 0 0 0 0 0 3 2010
3 60 78.0 9978 6 6 1998 1998 20.0 602.0 0.0 ... 470.0 360 36 0 0 0 0 0 6 2010
4 120 43.0 5005 8 5 1992 1992 0.0 263.0 0.0 ... 506.0 0 82 0 0 144 0 0 1 2010

5 rows × 36 columns

# 计算预测值
predictions = model.predict(test)
# 生成提交csv
submissions = pd.DataFrame()
submissions["Id"] = df.Id
submissions["SalePrice"] = predictions
submissions.to_csv("./data/hp_upload.csv", index=False)

你可能感兴趣的:(#,Keras)