import pandas as pd
train_df = pd.read_csv(r'H:\DataAnalysis\predictprice\train.csv')
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id 1460 non-null int64
MSSubClass 1460 non-null int64
MSZoning 1460 non-null object
LotFrontage 1201 non-null float64
LotArea 1460 non-null int64
Street 1460 non-null object
Alley 91 non-null object
YrSold 1460 non-null int64
SaleType 1460 non-null object
SaleCondition 1460 non-null object
SalePrice 1460 non-null int64
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB
类别 | 特征 |
住宅概况 | BldgType HouseStyle YearBuilt YearRemodAdd LandContour LandSlope MSSubClass Functional MiscFeature MiscVal |
建筑详情 | OverallQual OverallCond RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond 3SsnPorch ScreenPorch EnclosedPorch OpenPorchSF Foundation WoodDeckSF |
房间及硬件 | Kitchen KitchenQual Bedroom FullBath HalfBath PoolArea PoolQC Fireplaces FireplaceQu TotRmsAbvGrd Fence |
楼层面积 | 1stFlrSF2ndFlrSFLowQualFinSFGrLivArea |
地下室相关 | BsmtQual (地下室高度) BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF BsmtFullBath BsmtHalfBath |
地块 | LotFrontage LotArea LotShape |
交通及环境 | Street Alley MSZoning Neighborhood Utilities PavedDrive Condition1 Condition2 |
设备 | Heating HeatingQC Electrical CentralAir |
车库 | GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageCond GarageQual |
房屋售出详情 | MoSold YrSold SaleType SaleCondition |
LotFrontage 259
Alley 1369
MasVnrType 8
MasVnrArea 8
BsmtQual 37
BsmtCond 37
BsmtExposure 38
BsmtFinType1 37
BsmtFinType2 38
Electrical 1
FireplaceQu 690
GarageType 81
GarageYrBlt 81
GarageFinish 81
GarageQual 81
GarageCond 81
PoolQC 1453
Fence 1179
MiscFeature 1406
MSZoning 4
LotFrontage 227
Alley 1352
Utilities 2
Exterior1st 1
Exterior2nd 1
MasVnrType 16
MasVnrArea 15
BsmtQual 44
BsmtCond 45
BsmtExposure 44
BsmtFinType1 42
BsmtFinSF1 1
BsmtFinType2 42
BsmtFinSF2 1
BsmtUnfSF 1
TotalBsmtSF 1
BsmtFullBath 2
BsmtHalfBath 2
KitchenQual 1
Functional 2
FireplaceQu 730
GarageType 76
GarageYrBlt 78
GarageFinish 78
GarageCars 1
GarageArea 1
GarageQual 78
GarageCond 78
PoolQC 1456
Fence 1169
MiscFeature 1408
SaleType 1
dtype: int64
变量说明中可以看到有些类型变量值NA本来就是一个分类值,比如Alley, Basement系列的,FireplaceQu,Garage系列的,PoolQC,Fence,MiscFeature。在这些特征中NA代表着None,而不是缺失值。
Alley: Type of alley access to property
Grvl Gravel
Pave Paved
NA No alley access
BsmtQual: Evaluates the height of the basement
Ex Excellent (100+ inches)
Gd Good (90-99 inches)
TA Typical (80-89 inches)
Fa Fair (70-79 inches)
Po Poor (<70 inches
NA No Basement
FireplaceQu: Fireplace quality
Ex Excellent - Exceptional Masonry Fireplace
Gd Good - Masonry Fireplace in main level
TA Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement
Fa Fair - Prefabricated Fireplace in basement
Po Poor - Ben Franklin Stove
NA No Fireplace
GarageType: Garage location
2Types More than one type of garage
Attchd Attached to home
Basment Basement Garage
BuiltIn Built-In (Garage part of house - typically has room above garage)
CarPort Car Port
Detchd Detached from home
NA No Garage
有一些有缺失值的特征应该定义为0,就像如果没有车库,那么相应的车库面积及车库可放车数都应该是0,比如 GarageCars,GarageArea ,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,BsmtFullBath,BsmtHalfBath。
train dataset number type test dataset number type
LotFrontage 259 miss LotFrontage 227 miss
Alley 1369 None Alley 1352 None
MasVnrType 8 miss or none MasVnrType 16 miss or none
MasVnrArea 8 miss or none MasVnrArea 15 miss or none
BsmtQual 37 None BsmtQual 44 None
BsmtCond 37 None BsmtCond 45 None
BsmtExposure 38 None BsmtExposure 44 None
BsmtFinType1 37 None BsmtFinType1 42 None
BsmtFinType2 38 None BsmtFinType2 42 None
Electrical 1 miss
FireplaceQu 690 None FireplaceQu 730 None
GarageType 81 None GarageType 76 None
GarageYrBlt 81 None GarageYrBlt 78 None
GarageFinish 81 None GarageFinish 78 None
GarageQual 81 None GarageQual 78 None
GarageCond 81 None GarageCond 78 None
PoolQC 1453 None PoolQC 1456 None
Fence 1179 None Fence 1169 None
MiscFeature 1406 None MiscFeature 1408 None
MSZoning 4 miss
Utilities 2 miss
Exterior1st 1 miss
Exterior2nd 1 miss
GarageCars 1 may be 0
GarageArea 1 may be 0
BsmtFinSF1 1 may be 0
BsmtFinSF2 1 may be 0
BsmtUnfSF 1 may be 0
TotalBsmtSF 1 may be 0
BsmtFullBath 2 may be 0
BsmtHalfBath 2 may be 0
KitchenQual 1 miss
Functional 2 miss
SaleType 1 miss
# 针对NA代表None的情况,直接将NA用None替换
train_feature1 = ('Alley', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'FireplaceQu',
'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageQual', 'GarageCond', 'PoolQC', 'Fence',
'MiscFeature', 'MasVnrType')
test_feature1 = ('Alley', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'FireplaceQu',
'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageQual', 'GarageCond', 'PoolQC', 'Fence',
'MiscFeature', 'MasVnrType')
for loopi in train_feature1:
train_df[loopi] = train_df[loopi].fillna('None')
for loopj in test_feature1:
test_df[loopj] = test_df[loopj].fillna('None')
# 针对NA代表0的情况,直接用0替换
train_df['MasVnrArea'] = train_df['MasVnrArea'].fillna(0)
test_feature2 = ('GarageCars', 'GarageArea', 'BsmtFinSF1', 'BsmtUnfSF', 'TotalBsmtSF', 'BsmtFinSF2', 'BsmtFullBath', 'BsmtHalfBath', 'MasVnrArea')
for loopk in test_feature2:
test_df[loopk] = test_df[loopk].fillna(0)
# 针对NA代表缺失值,暂采用众数(文本型)或者平均数(数值型)替换
train_df['Electrical'] = train_df['Electrical'].fillna(train_df.Electrical.mode()[0])
test_feature3 = ('MSZoning', 'Utilities', 'KitchenQual', 'Functional', 'SaleType', 'Exterior1st', 'Exterior2nd')
for loopn in test_feature3:
test_df[loopn] = test_df[loopn].fillna(test_df[loopn].mode()[0])
train_df['LotFrontage'] = train_df['LotFrontage'].fillna(train_df['LotFrontage'].mean())
test_df['LotFrontage'] = test_df['LotFrontage'].fillna(test_df['LotFrontage'].mean())
4.其他互相相关系数较高的有(1stFlrSF, TotalBsmtSF)=0.82.
train_df.drop([524,1299], inplace=True)
drop_feature = ('GarageArea', 'GarageYrBlt', 'TotalBsmtSF', 'TotalRmsAbvGrd', 'BsmtFinSF1', '1stFlrSF')
train_df.drop(drop_feature, axis=1, inplace=True)
train_df['MSSubClass'] = train_df['MSSubClass'].astype(str)
test_df['MSSubClass'] = test_df['MSSubClass'].astype(str)
from scipy.stats import norm
sns.distplot(train_df.SalePrice, fit=norm)
price = np.log(train_df.SalePrice)
dataset = pd.concat([train_df, test_df])
skew_value = dataset.select_dtypes(include=['int64', 'float']).apply(lambda x: skew(x.dropna()))
skew_df = pd.DataFrame({'Skew':skew_value})
skew_df = skew_df[np.abs(skew_df.Skew)>0.5]
for loopm in skew_df.index.drop('SalePrice').values:
dataset[loopm] = boxcox1p(dataset[loopm], 0.1)
dataset = pd.get_dummis(dataset)
sc = RobustScaler()
train_feature = sc.fit_transform(train_feature)
test_feature = sc.transform(test_feature)
# lasso model
model = Lasso(alpha=0.0005, random_state=0)
model.fit(train_feature, price)
predict = model.predict(test_feature)
predicts = np.exp(predict)
output = pd.DataFrame({'Id':test_df.Id, 'SalePrice':predicts})
output.to_csv(r'H:\DataAnalysis\predictprice\regression.csv', index=False)