参考:https://zhuanlan.zhihu.com/p/335673241
https://blog.csdn.net/wangc1994/article/details/100760804
import pandas as pd
# 读取数据
train_data = pd.read_csv("./data/train.csv")
test_data = pd.read_csv("./data/test.csv")
GrLivArea: Above grade (ground) living area square feet
# 地上居住面积中的离群点
train_data = train_data.drop(train_data[(train_data['SalePrice'] < 300000) & (train_data['GrLivArea'] > 4000)].index)
# 找出所有缺失的列, 并排序
miss_data = all_features.isnull().sum()
miss_data_top = miss_data[miss_data > 0].sort_values(ascending=False)
print(miss_data_top)
PoolQC 2908
MiscFeature 2812
Alley 2719
Fence 2346
FireplaceQu 1420
LotFrontage 486
GarageFinish 159
GarageQual 159
GarageCond 159
GarageYrBlt 159
GarageType 157
BsmtExposure 82
BsmtCond 82
BsmtQual 81
BsmtFinType2 80
BsmtFinType1 79
MasVnrType 24
MasVnrArea 23
MSZoning 4
BsmtFullBath 2
BsmtHalfBath 2
Functional 2
Utilities 2
GarageArea 1
GarageCars 1
Electrical 1
KitchenQual 1
TotalBsmtSF 1
BsmtUnfSF 1
BsmtFinSF2 1
BsmtFinSF1 1
Exterior2nd 1
Exterior1st 1
SaleType 1
# 删除缺失过多的列
_ = ['PoolQC', 'MiscFeature', 'Alley', 'Fence']
for __ in _:
all_features = all_features.drop([__], axis=1)
# 填空值
cols1 = ["FireplaceQu", "GarageQual", "GarageCond", "GarageFinish", "GarageYrBlt", "GarageType",
"BsmtExposure", "BsmtCond", "BsmtQual", "BsmtFinType2", "BsmtFinType1", "MasVnrType"]
for _ in cols1:
all_features[_] = all_features[_].fillna("None")
# 填0
cols2=["MasVnrArea", "BsmtUnfSF", "TotalBsmtSF", "GarageCars", "BsmtFinSF2", "BsmtFinSF1", "GarageArea"]
for _ in cols2:
all_features[_] = all_features[_].fillna(0)
# 填众数
cols3 = ["MSZoning", "BsmtFullBath", "BsmtHalfBath", "Utilities", "Functional", "Electrical", "KitchenQual", "SaleType","Exterior1st", "Exterior2nd"]
for _ in cols3:
all_features[_] = all_features[_].fillna(all_features[_].mode()[0])
# 填均值
all_features["LotFrontage"] = all_features["LotFrontage"].fillna(np.mean(all_features["LotFrontage"]))
地下室总面积
TotalBsmtSF: Total square feet of basement area
地皮面积
LotArea: Lot size in square feet
数据类型为整数或浮点数,但只是起到标记和简化数据的作用。
住宅类型
MSSubClass: Identifies the type of dwelling involved in the sale.
20 1-STORY 1946 & NEWER ALL STYLES
30 1-STORY 1945 & OLDER
40 1-STORY W/FINISHED ATTIC ALL AGES
...
# 数值型转换为标称型
num2std = ['MSSubClass', 'YrSold', 'MoSold']
for _ in num2std:
all_features[_] = all_features[_].astype(str)
数据类型为字符串,但是具有顺序信息,例如好与坏,多与少,时序信息等。转化为数字型可以让模型学习到更多这些信息。
壁炉质量
FireplaceQu: Fireplace quality
Ex Excellent - Exceptional Masonry Fireplace
Gd Good - Masonry Fireplace in main level
TA Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement
Fa Fair - Prefabricated Fireplace in basement
Po Poor - Ben Franklin Stove
NA No Fireplace
地下室高度
BsmtQual: Evaluates the height of the basement
Ex Excellent (100+ inches)
Gd Good (90-99 inches)
TA Typical (80-89 inches)
Fa Fair (70-79 inches)
Po Poor (<70 inches
NA No Basement
其余还有:
地下室整体情况BsmtCond、地下室完工面积等级BsmtFinType12
车库质量GarageQual、车库现状GarageCond、车库装修GarageFinish
外部质量ExterQual、外部现状ExterCond、中央空调CentralAir
供热质量HeatingQC、泳池质量PoolQC、厨房质量KitchenQual
栅栏质量Fence、家庭功能性Functional、花园受光程度BsmtExposure
物业坡度LandSlope、物业基本形状LotShape、车道铺路情况PavedDrive
去物业的小巷类型Alley、去物业的道路类型Street
# 定序型转换为数字型
sort2num = ['GarageYrBlt', 'GarageType', 'MasVnrType', 'FireplaceQu', 'BsmtQual', 'BsmtCond', 'GarageQual', 'GarageCond','ExterQual',
'ExterCond','HeatingQC', 'KitchenQual', 'BsmtFinType1','MSZoning', 'Electrical',
'BsmtFinType2', 'Functional', 'BsmtExposure', 'GarageFinish', 'LandSlope',
'LotShape', 'PavedDrive', 'Street', 'CentralAir']
for _ in sort2num:
ll = preprocessing.LabelEncoder()
ll.fit(list(all_features[_].values))
all_features[_] = ll.transform(list(all_features[_].values))
# 对数字型标准化
numeric_ = all_features.dtypes[all_features.dtypes != 'object'].index
all_features[numeric_] = all_features[numeric_].apply(lambda x:((x-x.mean())/(x.std())))
# 对离散型变量变为one-hot
all_features = pd.get_dummies(all_features, dummy_na=True)
# 分配训练集和测试集
num_train = train_data.shape[0]
train_features = torch.tensor(all_features[:num_train].values, dtype = torch.float)
test_features = torch.tensor(all_features[num_train:].values, dtype = torch.float)
train_labels = torch.tensor(train_data.SalePrice.values, dtype = torch.float).view(-1,1)