1. 定义问题
考虑到推土机的特性,利用过去的数据,我们能多大程度上预测它未来的价格?
2. 数据来源
kaggle:https://www.kaggle.com/competitions/bluebook-for-bulldozers/overview
The data for this competition is split into three parts:
3. 评价标准
The evaluation metric for this competition is the RMSLE (root mean squared log error) between the actual and predicted auction prices.
更多信息:https://www.kaggle.com/competitions/bluebook-for-bulldozers/overview/evaluation
注意:多数回归模型的评价标准都是减小误差。比如这次的目标就是最小化RMSLE。
4. 使用的特征
特征过多,请自行进入kaggle项目主页查看。或点击如下谷歌表格链接:https://docs.google.com/spreadsheets/d/1EIbdGa4S_46USXgg0OHX5jgTc8ld9fTHwPyi_VOV1as/edit#gid=0
# EDA
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
sns.set()
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
%config InlineBackend.figure_config = 'svg'
warnings.filterwarnings("ignore")
# 数据预处理
from sklearn.preprocessing import LabelEncoder
# sklearn模型
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
# 模型评估
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_log_error, mean_absolute_error, mean_squared_error, r2_score
bulldozer_df = pd.read_csv('bluebook-for-bulldozers/TrainAndValid.csv',
low_memory = False)
appendix_df = pd.read_csv('bluebook-for-bulldozers/Machine_Appendix.csv',
low_memory = False)
# 查看各字段类型
bulldozer_df.info()
RangeIndex: 412698 entries, 0 to 412697
Data columns (total 53 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 SalesID 412698 non-null int64
1 SalePrice 412698 non-null float64
2 MachineID 412698 non-null int64
3 ModelID 412698 non-null int64
4 datasource 412698 non-null int64
5 auctioneerID 392562 non-null float64
6 YearMade 412698 non-null int64
7 MachineHoursCurrentMeter 147504 non-null float64
8 UsageBand 73670 non-null object
9 saledate 412698 non-null object
10 fiModelDesc 412698 non-null object
11 fiBaseModel 412698 non-null object
12 fiSecondaryDesc 271971 non-null object
13 fiModelSeries 58667 non-null object
14 fiModelDescriptor 74816 non-null object
15 ProductSize 196093 non-null object
16 fiProductClassDesc 412698 non-null object
17 state 412698 non-null object
18 ProductGroup 412698 non-null object
19 ProductGroupDesc 412698 non-null object
20 Drive_System 107087 non-null object
21 Enclosure 412364 non-null object
22 Forks 197715 non-null object
23 Pad_Type 81096 non-null object
24 Ride_Control 152728 non-null object
25 Stick 81096 non-null object
26 Transmission 188007 non-null object
27 Turbocharged 81096 non-null object
28 Blade_Extension 25983 non-null object
29 Blade_Width 25983 non-null object
30 Enclosure_Type 25983 non-null object
31 Engine_Horsepower 25983 non-null object
32 Hydraulics 330133 non-null object
33 Pushblock 25983 non-null object
34 Ripper 106945 non-null object
35 Scarifier 25994 non-null object
36 Tip_Control 25983 non-null object
37 Tire_Size 97638 non-null object
38 Coupler 220679 non-null object
39 Coupler_System 44974 non-null object
40 Grouser_Tracks 44875 non-null object
41 Hydraulics_Flow 44875 non-null object
42 Track_Type 102193 non-null object
43 Undercarriage_Pad_Width 102916 non-null object
44 Stick_Length 102261 non-null object
45 Thumb 102332 non-null object
46 Pattern_Changer 102261 non-null object
47 Grouser_Type 102193 non-null object
48 Backhoe_Mounting 80712 non-null object
49 Blade_Type 81875 non-null object
50 Travel_Controls 81877 non-null object
51 Differential_Type 71564 non-null object
52 Steering_Controls 71522 non-null object
dtypes: float64(3), int64(5), object(45)
memory usage: 166.9+ MB
# 查看缺失值
bulldozer_df.isna().sum()
SalesID 0
SalePrice 0
MachineID 0
ModelID 0
datasource 0
auctioneerID 20136
YearMade 0
MachineHoursCurrentMeter 265194
UsageBand 339028
saledate 0
fiModelDesc 0
fiBaseModel 0
fiSecondaryDesc 140727
fiModelSeries 354031
fiModelDescriptor 337882
ProductSize 216605
fiProductClassDesc 0
state 0
ProductGroup 0
ProductGroupDesc 0
Drive_System 305611
Enclosure 334
Forks 214983
Pad_Type 331602
Ride_Control 259970
Stick 331602
Transmission 224691
Turbocharged 331602
Blade_Extension 386715
Blade_Width 386715
Enclosure_Type 386715
Engine_Horsepower 386715
Hydraulics 82565
Pushblock 386715
Ripper 305753
Scarifier 386704
Tip_Control 386715
Tire_Size 315060
Coupler 192019
Coupler_System 367724
Grouser_Tracks 367823
Hydraulics_Flow 367823
Track_Type 310505
Undercarriage_Pad_Width 309782
Stick_Length 310437
Thumb 310366
Pattern_Changer 310437
Grouser_Type 310505
Backhoe_Mounting 331986
Blade_Type 330823
Travel_Controls 330821
Differential_Type 341134
Steering_Controls 341176
dtype: int64
# 查看标签分布
bulldozer_df['SalePrice'].hist()
# 改为帕斯卡命名
bulldozer_df.rename(columns={'saledate': 'SaleDate'}, inplace=True)
bulldozer_df['SaleDate'] = pd.to_datetime(bulldozer_df['SaleDate'])
# 按时间查看售价,只看最近几年的,没有明显的上升趋势,只有上下波动
plt.figure(figsize=(20,15))
pd.pivot_table(bulldozer_df[200000::1000], index='SaleDate', values='SalePrice').plot()
plt.show()
2005年前后半年和2008年整个一年销量都比较差,除此之外看不出太多信息
# 将数据集按时间排序
bulldozer_df.sort_values(by='SaleDate', inplace=True)
bulldozer_df['SaleDate'].head(20)
205615 1989-01-17
274835 1989-01-31
141296 1989-01-31
212552 1989-01-31
62755 1989-01-31
54653 1989-01-31
81383 1989-01-31
204924 1989-01-31
135376 1989-01-31
113390 1989-01-31
113394 1989-01-31
116419 1989-01-31
32138 1989-01-31
127610 1989-01-31
76171 1989-01-31
127000 1989-01-31
128130 1989-01-31
127626 1989-01-31
55455 1989-01-31
55454 1989-01-31
Name: SaleDate, dtype: datetime64[ns]
这个数据集比较特殊,还有个appendix表,里面是一些推土机的配件信息,而且这个信息和train表有重复特征,重复特征里面还有匹配不上的情况,现在先联表上来看看
# 制作数据集副本,这是为了方便对数据集做了什么操作后,仍然可以获取原始数据,而不用从头读数据
bd_df = bulldozer_df.copy()
app_df = appendix_df.copy()
# SalesID列丢弃
bd_df.drop('SalesID', axis=1, inplace=True)
# 定义一个查看出入的函数
def check_difference(df1, df2, target_col, on_col):
temp_df = pd.merge(df1[[on_col, target_col]], df2[[on_col, target_col]], how='left', on=on_col)
return temp_df[(temp_df[target_col+'_x'] != temp_df[target_col+'_y'])]
# 定义一个合并时互补的函数,冲突时保留df1的数据
def combine(df1, df2, target_col, on_col):
temp_df0 = pd.merge(df1[[on_col, target_col]], df2[[on_col, target_col]], how='left', on=on_col)
temp_df0.fillna('', inplace=True)
temp_df1=temp_df0[(temp_df0[target_col+'_x']== '') & (temp_df0[target_col+'_y']!='')]
temp_df0[target_col+'_x'].loc[temp_df1.index] = temp_df1[target_col+'_y']
df1[target_col] = temp_df0[target_col+'_x']
df2.drop(columns=[target_col], inplace=True)
return df1
ProductGroup是ProductGroupDesc的首字母缩写版,选择保留后者
fiManufacturerID和fiManufacturerDesc包含的信息一样,选择保留前者
bd_df.drop(columns=['ProductGroup', 'fiProductClassDesc'], inplace=True)
app_df.drop(columns=['ModelID', 'fiModelDesc', 'ProductGroup', 'fiManufacturerDesc'], inplace=True)
# 查看枚举值,后续需要做分箱合并处理
bd_df['fiBaseModel'].value_counts()
580 20179
310 17886
D6 13527
416 12900
D5 9636
...
56 1
B4230 1
IS30 1
MM555 1
WLK15 1
Name: fiBaseModel, Length: 1961, dtype: int64
# 查看出入部分有无空值
check_difference(bd_df, app_df, 'fiBaseModel', 'MachineID').isna().sum()
MachineID 0
fiBaseModel_x 0
fiBaseModel_y 0
dtype: int64
# 查看出入部分
check_difference(bd_df, app_df, 'fiBaseModel', 'MachineID')
MachineID | fiBaseModel_x | fiBaseModel_y | |
---|---|---|---|
71 | 1523610 | WA150 | HD465 |
98 | 1059447 | WA300 | PC100 |
123 | 1303779 | 415 | 862 |
128 | 1340389 | MS240 | MS120 |
179 | 1208516 | D31 | PC100 |
... | ... | ... | ... |
412474 | 2308891 | 1845 | 75 |
412484 | 2287735 | T135 | T133 |
412509 | 2292146 | 450 | 465 |
412655 | 1846321 | TB135 | TB125 |
412695 | 1918416 | 337 | 530 |
12452 rows × 3 columns
# 选择保留bd_df表的数据
app_df.drop(columns=['fiBaseModel'], inplace=True)
# 查看枚举值,后续需要做分箱合并处理
bd_df['fiSecondaryDesc'].value_counts()
C 44431
B 40165
G 37915
H 24729
E 21532
...
BLGPPS 1
MSR 1
LC7A 1
CL 1
BH 1
Name: fiSecondaryDesc, Length: 177, dtype: int64
# 查看不一致部分
check_difference(bd_df, app_df, 'fiSecondaryDesc', 'MachineID')
MachineID | fiSecondaryDesc_x | fiSecondaryDesc_y | |
---|---|---|---|
0 | 1126363 | NaN | NaN |
1 | 1194089 | NaN | NaN |
3 | 1327630 | NaN | NaN |
6 | 1082797 | NaN | NaN |
7 | 1527216 | NaN | NaN |
... | ... | ... | ... |
412690 | 1823846 | NaN | NaN |
412691 | 1278794 | NaN | NaN |
412692 | 1792049 | NaN | NaN |
412694 | 1919104 | NaN | NaN |
412695 | 1918416 | G | NaN |
147009 rows × 3 columns
# 合并互补
bd_df = combine(bd_df, app_df, 'fiSecondaryDesc', 'MachineID')
# 查看枚举值,后续需要做分箱合并处理
bd_df['fiModelSeries'].value_counts()
II 13770
LC 9175
III 5351
-1 4646
-2 4033
...
LL 1
6F 1
-2LC 1
-5A 1
VII 1
Name: fiModelSeries, Length: 123, dtype: int64
# 查看不一致部分
check_difference(bd_df, app_df, 'fiModelSeries', 'MachineID')
MachineID | fiModelSeries_x | fiModelSeries_y | |
---|---|---|---|
0 | 1126363 | NaN | NaN |
1 | 1194089 | NaN | NaN |
2 | 1473654 | NaN | NaN |
3 | 1327630 | NaN | NaN |
4 | 1336053 | NaN | NaN |
... | ... | ... | ... |
412693 | 1915521 | NaN | NaN |
412694 | 1919104 | NaN | NaN |
412695 | 1918416 | NaN | NaN |
412696 | 509560 | NaN | NaN |
412697 | 1869284 | NaN | NaN |
362366 rows × 3 columns
# 合并互补
bd_df = combine(bd_df, app_df, 'fiModelSeries', 'MachineID')
# 查看枚举值
bd_df['fiModelDescriptor'].value_counts()
L 16464
LGP 16143
LC 13295
XL 6700
6 2944
...
K5 1
HighLift 1
High Lift 1
III 1
SL 1
Name: fiModelDescriptor, Length: 140, dtype: int64
# 查看不一致部分
check_difference(bd_df, app_df, 'fiModelDescriptor', 'MachineID')
MachineID | fiModelDescriptor_x | fiModelDescriptor_y | |
---|---|---|---|
0 | 1126363 | NaN | NaN |
1 | 1194089 | NaN | NaN |
2 | 1473654 | NaN | NaN |
3 | 1327630 | NaN | NaN |
4 | 1336053 | NaN | NaN |
... | ... | ... | ... |
412693 | 1915521 | NaN | NaN |
412694 | 1919104 | NaN | NaN |
412695 | 1918416 | NaN | NaN |
412696 | 509560 | NaN | NaN |
412697 | 1869284 | NaN | NaN |
342069 rows × 3 columns
# 合并互补
bd_df = combine(bd_df, app_df, 'fiModelDescriptor', 'MachineID')
# 查看枚举值
bd_df['ProductGroupDesc'].value_counts()
Track Excavators 104230
Track Type Tractors 82582
Backhoe Loaders 81401
Wheel Loader 73216
Skid Steer Loaders 45011
Motor Graders 26258
Name: ProductGroupDesc, dtype: int64
# 枚举值太多了,而且bd_df里的数据并无空值,这里就选择不互补了,直接保留bd_df的数据
app_df['ProductGroupDesc'].value_counts()
Track Excavators 89094
Backhoe Loaders 74074
Track Type Tractors 67362
Wheel Loader 62426
Skid Steer Loaders 42121
Motor Graders 22602
Track Loaders 158
Articulated Trucks 109
Ag Tractors 103
Wheel Tractor Scraper 102
Off Highway Trucks 81
Multi Terrain Loaders 66
Wheel Excavator 66
Forklift 53
Skidders 46
Wheel Feller Buncher 31
Forestry Log Loaders 25
Pipelayers 11
Telehandler 9
Wheel Dozer 7
Knuckleboom Loaders 6
Track Feller Bunchers 6
Vibratory Double Drum Asphalt 4
Vibratory Single Drum Asphalt 4
Work Tool 3
Pneumatic Tired Compactor 3
Compactors 3
Vibratory Single Drum Pad 3
Tandem Roller Static 2
Harvesters 2
Forwarders 2
Engine, Industrial OEM 2
Track Harvesters 1
Delimber Forestry 1
Asphalt/Concrete Pavers 1
Vibratory Single Drum Smooth 1
Crane/Dragline 1
Forestry Excavators 1
Cold Planers 1
Name: ProductGroupDesc, dtype: int64
app_df.drop(columns=['ProductGroupDesc'], inplace=True)
# 两边同一个特征名字不一样,先改名
app_df.rename(columns={'MfgYear': 'YearMade'}, inplace=True)
# 查看枚举值,发现异常值1000
bd_df['YearMade'].value_counts()
1000 39391
2005 22096
1998 21751
2004 20914
1999 19274
...
2012 1
1949 1
1942 1
2013 1
1937 1
Name: YearMade, Length: 73, dtype: int64
# 查看不一致部分
check_difference(bd_df, app_df, 'YearMade', 'MachineID')
MachineID | YearMade_x | YearMade_y | |
---|---|---|---|
52 | 1531656 | 1975 | 1987.0 |
58 | 1078132 | 1967 | 1966.0 |
71 | 1523610 | 1986 | 2007.0 |
98 | 1059447 | 1984 | 1996.0 |
122 | 1525180 | 1984 | 1985.0 |
... | ... | ... | ... |
412651 | 1631729 | 1000 | 1990.0 |
412654 | 1912106 | 1000 | 1996.0 |
412655 | 1846321 | 1000 | 2005.0 |
412667 | 1897535 | 1998 | 1999.0 |
412696 | 509560 | 1993 | 1990.0 |
36406 rows × 3 columns
这种不一致通常是因为bd_df里面有重复的MachineID,即同一种推土机被卖了多次,而app_df里只有唯一MachineID所导致,决定保留bd_df的数据,不动它
app_df.drop('YearMade', axis=1, inplace=True)
这三个特征原训练集上都没有,直接联过去
# total_df = bd_df.copy()
total_df = pd.merge(bd_df, app_df, how='left', on='MachineID')
# 动力上限和下限留一个就足够区分了,取下限
total_df.drop(columns=['PrimaryUpper'], inplace=True)
total_df.info()
Int64Index: 412698 entries, 0 to 412697
Data columns (total 54 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 SalePrice 412698 non-null float64
1 MachineID 412698 non-null int64
2 ModelID 412698 non-null int64
3 datasource 412698 non-null int64
4 auctioneerID 392562 non-null float64
5 YearMade 412698 non-null int64
6 MachineHoursCurrentMeter 147504 non-null float64
7 UsageBand 73670 non-null object
8 SaleDate 412698 non-null datetime64[ns]
9 fiModelDesc 412698 non-null object
10 fiBaseModel 412698 non-null object
11 fiSecondaryDesc 412698 non-null object
12 fiModelSeries 412698 non-null object
13 fiModelDescriptor 412698 non-null object
14 ProductSize 196093 non-null object
15 state 412698 non-null object
16 ProductGroupDesc 412698 non-null object
17 Drive_System 107087 non-null object
18 Enclosure 412364 non-null object
19 Forks 197715 non-null object
20 Pad_Type 81096 non-null object
21 Ride_Control 152728 non-null object
22 Stick 81096 non-null object
23 Transmission 188007 non-null object
24 Turbocharged 81096 non-null object
25 Blade_Extension 25983 non-null object
26 Blade_Width 25983 non-null object
27 Enclosure_Type 25983 non-null object
28 Engine_Horsepower 25983 non-null object
29 Hydraulics 330133 non-null object
30 Pushblock 25983 non-null object
31 Ripper 106945 non-null object
32 Scarifier 25994 non-null object
33 Tip_Control 25983 non-null object
34 Tire_Size 97638 non-null object
35 Coupler 220679 non-null object
36 Coupler_System 44974 non-null object
37 Grouser_Tracks 44875 non-null object
38 Hydraulics_Flow 44875 non-null object
39 Track_Type 102193 non-null object
40 Undercarriage_Pad_Width 102916 non-null object
41 Stick_Length 102261 non-null object
42 Thumb 102332 non-null object
43 Pattern_Changer 102261 non-null object
44 Grouser_Type 102193 non-null object
45 Backhoe_Mounting 80712 non-null object
46 Blade_Type 81875 non-null object
47 Travel_Controls 81877 non-null object
48 Differential_Type 71564 non-null object
49 Steering_Controls 71522 non-null object
50 fiProductClassDesc 412698 non-null object
51 fiManufacturerID 412698 non-null int64
52 PrimarySizeBasis 407439 non-null object
53 PrimaryLower 407439 non-null float64
dtypes: datetime64[ns](1), float64(4), int64(5), object(44)
memory usage: 173.2+ MB
raise KeyError
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Input In [903], in ()
----> 1 raise KeyError
KeyError:
|
# 173怀疑是172写错成173
total_df['datasource'].value_counts()
132 260776
136 75491
149 33325
121 25191
172 17914
173 1
Name: datasource, dtype: int64
# 后续要合并
total_df['auctioneerID'].value_counts()
1.0 192773
2.0 57441
3.0 30288
4.0 20877
99.0 12042
6.0 11950
7.0 7847
8.0 7419
5.0 7002
10.0 5876
9.0 4764
11.0 3823
12.0 3610
13.0 3068
18.0 2359
14.0 2277
20.0 2238
19.0 2074
16.0 1807
15.0 1742
21.0 1601
22.0 1429
24.0 1357
23.0 1322
17.0 1275
27.0 1150
25.0 959
28.0 860
26.0 796
0.0 536
Name: auctioneerID, dtype: int64
# 查看auctioneerID和价格的关系
temp = pd.pivot_table(total_df, index='auctioneerID', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('median', 'SalePrice'))
count | mean | median | |
---|---|---|---|
SalePrice | SalePrice | SalePrice | |
auctioneerID | |||
22.0 | 1429 | 18230.020994 | 14000.0 |
18.0 | 2359 | 18839.799491 | 14500.0 |
21.0 | 1601 | 19645.315428 | 15000.0 |
25.0 | 959 | 22478.519291 | 16000.0 |
9.0 | 4764 | 22515.003149 | 16775.0 |
14.0 | 2277 | 20804.797980 | 17000.0 |
17.0 | 1275 | 22974.941176 | 18000.0 |
12.0 | 3610 | 24762.936288 | 18250.0 |
28.0 | 860 | 28312.674419 | 18750.0 |
20.0 | 2238 | 24737.811439 | 19250.0 |
99.0 | 12042 | 26958.806112 | 19500.0 |
13.0 | 3068 | 27017.441330 | 20500.0 |
16.0 | 1807 | 26261.427781 | 20500.0 |
27.0 | 1150 | 27606.739130 | 22500.0 |
4.0 | 20877 | 29825.726876 | 23000.0 |
23.0 | 1322 | 30613.729198 | 23000.0 |
2.0 | 57441 | 29023.051670 | 23000.0 |
10.0 | 5876 | 29561.546971 | 23000.0 |
0.0 | 536 | 29979.850746 | 23000.0 |
15.0 | 1742 | 29986.337543 | 23500.0 |
5.0 | 7002 | 29150.217795 | 23500.0 |
24.0 | 1357 | 33041.156964 | 24500.0 |
1.0 | 192773 | 32684.870075 | 25000.0 |
8.0 | 7419 | 32477.978838 | 25000.0 |
11.0 | 3823 | 32707.088674 | 25500.0 |
3.0 | 30288 | 33596.962592 | 26000.0 |
6.0 | 11950 | 34708.070628 | 27000.0 |
26.0 | 796 | 36157.412060 | 28000.0 |
7.0 | 7847 | 36288.450236 | 28500.0 |
19.0 | 2074 | 42715.766635 | 34000.0 |
# 画图查看
temp.sort_values(by=('median', 'SalePrice'))[('median', 'SalePrice')].plot(kind='bar')
plt.show()
total_df['YearMade'] = total_df['YearMade'].astype(int)
total_df['YearMade'].value_counts()
1000 39391
2005 22096
1998 21751
2004 20914
1999 19274
...
2012 1
1949 1
1942 1
2013 1
1937 1
Name: YearMade, Length: 73, dtype: int64
plt.figure(figsize=(14,10))
sns.countplot(total_df['YearMade'])
plt.xticks(rotation=90)
plt.show()
第一台拖拉机1904年才发明出来,1904年前的属于异常数据。同时该数据集截至年份是2012年,大于2012的属于异常。异常值后续增加一个新的衍生变量YearMade_is_error区分。
生产时间大于销售时间的,暂时认为是提前订货,不处理。
# 查看枚举值
total_df['MachineHoursCurrentMeter'].value_counts()
0.0 73834
2000.0 124
1000.0 117
24.0 115
1500.0 101
...
10834.0 1
3499.0 1
26270.0 1
26901.0 1
17920.0 1
Name: MachineHoursCurrentMeter, Length: 15633, dtype: int64
# 查看枚举值
total_df['UsageBand'].value_counts()
Medium 35832
Low 25311
High 12527
Name: UsageBand, dtype: int64
total_df['fiBaseModel'].value_counts()
580 20179
310 17886
D6 13527
416 12900
D5 9636
...
56 1
B4230 1
IS30 1
MM555 1
WLK15 1
Name: fiBaseModel, Length: 1961, dtype: int64
total_df['fiSecondaryDesc'].value_counts().head(10)
136420
C 44658
B 40446
G 38139
H 24759
E 21944
D 20132
F 9454
K 8089
A 5968
Name: fiSecondaryDesc, dtype: int64
temp = pd.pivot_table(total_df, index='fiSecondaryDesc', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count | mean | median | |
---|---|---|---|
SalePrice | SalePrice | SalePrice | |
fiSecondaryDesc | |||
XT | 2177 | 29736.816720 | 23000.0 |
LE | 2510 | 31586.065737 | 23000.0 |
R | 3577 | 31164.435840 | 24000.0 |
N | 3844 | 31028.674558 | 23500.0 |
SUPER L | 3920 | 31237.168367 | 24000.0 |
P | 4475 | 32841.134078 | 26000.0 |
J | 4655 | 31069.470892 | 24000.0 |
LC | 4862 | 31265.127931 | 24000.0 |
M | 5398 | 32225.976288 | 24500.0 |
L | 5669 | 31634.053625 | 24500.0 |
A | 5968 | 31266.457775 | 24000.0 |
K | 8089 | 30815.391396 | 23000.0 |
F | 9454 | 31958.080389 | 25000.0 |
D | 20132 | 30766.167644 | 24000.0 |
E | 21944 | 31089.640357 | 24000.0 |
H | 24759 | 31279.811059 | 24000.0 |
G | 38139 | 30902.040200 | 23500.0 |
B | 40446 | 31238.508530 | 24000.0 |
C | 44658 | 31024.103699 | 24000.0 |
136420 | 31426.038213 | 24000.0 |
total_df['fiModelSeries'].value_counts().head(10)
350453
II 14039
LC 9609
III 5392
-1 5142
-2 4350
-6 3538
-3 2783
-5 2664
-12 1447
Name: fiModelSeries, dtype: int64
total_df['fiModelDescriptor'].value_counts().head(10)
333040
L 16676
LGP 16541
LC 15666
XL 6704
6 3238
LT 2681
5 2455
3 2149
CR 1798
Name: fiModelDescriptor, dtype: int64
temp = pd.pivot_table(total_df, index='fiModelDescriptor', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count | mean | median | |
---|---|---|---|
SalePrice | SalePrice | SalePrice | |
fiModelDescriptor | |||
K | 349 | 29030.372493 | 22000.0 |
8 | 351 | 33369.800570 | 27000.0 |
E | 359 | 34427.576602 | 25500.0 |
SR | 383 | 28176.827676 | 20000.0 |
ZTS | 442 | 28579.751131 | 21250.0 |
Z | 525 | 27814.190476 | 21000.0 |
2 | 627 | 32085.047847 | 26000.0 |
SSR | 684 | 33503.508772 | 27000.0 |
H | 1092 | 30992.971612 | 24000.0 |
7 | 1115 | 30983.373991 | 24000.0 |
CR | 1798 | 28777.157953 | 22000.0 |
3 | 2149 | 31477.175430 | 24000.0 |
5 | 2455 | 30897.668024 | 24000.0 |
LT | 2681 | 28216.729206 | 21000.0 |
6 | 3238 | 30573.305127 | 24000.0 |
XL | 6704 | 30564.901700 | 23000.0 |
LC | 15666 | 30387.232989 | 23000.0 |
LGP | 16541 | 31327.267033 | 24000.0 |
L | 16676 | 30739.471036 | 24000.0 |
333040 | 31340.137891 | 24000.0 |
total_df['ProductSize'].fillna('').value_counts()
216605
Medium 64342
Large / Medium 51297
Small 27057
Mini 25721
Large 21396
Compact 6280
Name: ProductSize, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='ProductSize', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count | mean | median | |
---|---|---|---|
SalePrice | SalePrice | SalePrice | |
ProductSize | |||
Compact | 6280 | 17498.578822 | 15000.0 |
Large | 21396 | 42023.433315 | 34000.0 |
Mini | 25721 | 15194.647720 | 13500.0 |
Small | 27057 | 32511.164061 | 29000.0 |
Large / Medium | 51297 | 47828.923231 | 44000.0 |
Medium | 64342 | 45703.873473 | 41000.0 |
216605 | 24047.383408 | 19000.0 |
# 查看数据分布
total_df['state'].fillna('').value_counts()
Florida 67320
Texas 53110
California 29761
Washington 16222
Georgia 14633
Maryland 13322
Mississippi 13240
Ohio 12369
Illinois 11540
Colorado 11529
New Jersey 11156
North Carolina 10636
Tennessee 10298
Alabama 10292
Pennsylvania 10234
South Carolina 9951
Arizona 9364
New York 8639
Connecticut 8276
Minnesota 7885
Missouri 7178
Nevada 6932
Louisiana 6627
Kentucky 5351
Maine 5096
Indiana 4124
Arkansas 3933
New Mexico 3631
Utah 3046
Unspecified 2801
Wisconsin 2745
New Hampshire 2738
Virginia 2353
Idaho 2025
Oregon 1911
Michigan 1831
Wyoming 1672
Montana 1336
Iowa 1336
Oklahoma 1326
Nebraska 866
West Virginia 840
Kansas 667
Delaware 510
North Dakota 480
Alaska 430
Massachusetts 347
Vermont 300
South Dakota 244
Hawaii 118
Rhode Island 83
Puerto Rico 42
Washington DC 2
Name: state, dtype: int64
# 查看州和价格的关系,有关系
temp = pd.pivot_table(total_df, index='state', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count | mean | median | |
---|---|---|---|
SalePrice | SalePrice | SalePrice | |
state | |||
Minnesota | 7885 | 27199.055168 | 19500.0 |
Connecticut | 8276 | 29008.808603 | 23500.0 |
New York | 8639 | 25582.237527 | 20000.0 |
Arizona | 9364 | 31562.857753 | 23000.0 |
South Carolina | 9951 | 29848.894583 | 22000.0 |
Pennsylvania | 10234 | 25463.002736 | 19000.0 |
Alabama | 10292 | 35438.541489 | 28000.0 |
Tennessee | 10298 | 31845.357351 | 24000.0 |
North Carolina | 10636 | 32161.773693 | 25000.0 |
New Jersey | 11156 | 30982.188867 | 24500.0 |
Colorado | 11529 | 31777.748287 | 25000.0 |
Illinois | 11540 | 29091.581456 | 22500.0 |
Ohio | 12369 | 28228.100493 | 22000.0 |
Mississippi | 13240 | 32574.592145 | 25000.0 |
Maryland | 13322 | 28621.682180 | 22000.0 |
Georgia | 14633 | 32265.550468 | 24000.0 |
Washington | 16222 | 27690.731722 | 21500.0 |
California | 29761 | 29815.202715 | 22500.0 |
Texas | 53110 | 32977.190341 | 25000.0 |
Florida | 67320 | 34387.512775 | 27000.0 |
# 进一步按价格排序查看
temp.sort_values(by=[('median', 'SalePrice'), ('mean', 'SalePrice')])
count | mean | median | |
---|---|---|---|
SalePrice | SalePrice | SalePrice | |
state | |||
Indiana | 4124 | 24400.357662 | 19000.0 |
Pennsylvania | 10234 | 25463.002736 | 19000.0 |
Maine | 5096 | 26176.942700 | 19500.0 |
Minnesota | 7885 | 27199.055168 | 19500.0 |
New York | 8639 | 25582.237527 | 20000.0 |
Kansas | 667 | 28093.403298 | 20000.0 |
Puerto Rico | 42 | 26011.904762 | 20250.0 |
Wisconsin | 2745 | 27656.247723 | 21000.0 |
Idaho | 2025 | 29263.234568 | 21000.0 |
Washington | 16222 | 27690.731722 | 21500.0 |
Virginia | 2353 | 28798.645984 | 21500.0 |
Ohio | 12369 | 28228.100493 | 22000.0 |
Maryland | 13322 | 28621.682180 | 22000.0 |
Michigan | 1831 | 29003.923648 | 22000.0 |
Missouri | 7178 | 29046.000697 | 22000.0 |
Kentucky | 5351 | 29815.380303 | 22000.0 |
South Carolina | 9951 | 29848.894583 | 22000.0 |
Vermont | 300 | 27285.333333 | 22250.0 |
Arkansas | 3933 | 28906.893974 | 22500.0 |
Illinois | 11540 | 29091.581456 | 22500.0 |
California | 29761 | 29815.202715 | 22500.0 |
Washington DC | 2 | 22750.000000 | 22750.0 |
Massachusetts | 347 | 28382.420749 | 23000.0 |
New Hampshire | 2738 | 28928.743608 | 23000.0 |
Oregon | 1911 | 30277.590267 | 23000.0 |
Delaware | 510 | 31160.098039 | 23000.0 |
Arizona | 9364 | 31562.857753 | 23000.0 |
Connecticut | 8276 | 29008.808603 | 23500.0 |
Louisiana | 6627 | 30201.893768 | 23500.0 |
Tennessee | 10298 | 31845.357351 | 24000.0 |
Iowa | 1336 | 31927.544910 | 24000.0 |
Oklahoma | 1326 | 32258.710407 | 24000.0 |
Georgia | 14633 | 32265.550468 | 24000.0 |
Montana | 1336 | 32616.654192 | 24000.0 |
Nebraska | 866 | 32612.066975 | 24250.0 |
New Jersey | 11156 | 30982.188867 | 24500.0 |
Colorado | 11529 | 31777.748287 | 25000.0 |
North Carolina | 10636 | 32161.773693 | 25000.0 |
Mississippi | 13240 | 32574.592145 | 25000.0 |
Wyoming | 1672 | 32604.425837 | 25000.0 |
Texas | 53110 | 32977.190341 | 25000.0 |
Hawaii | 118 | 28879.237288 | 26000.0 |
Alaska | 430 | 33281.976744 | 26000.0 |
New Mexico | 3631 | 33632.649408 | 26000.0 |
Utah | 3046 | 34190.547932 | 27000.0 |
Florida | 67320 | 34387.512775 | 27000.0 |
Nevada | 6932 | 36332.097519 | 27000.0 |
Unspecified | 2801 | 34857.711532 | 28000.0 |
Alabama | 10292 | 35438.541489 | 28000.0 |
North Dakota | 480 | 39083.750000 | 29750.0 |
West Virginia | 840 | 40258.750000 | 33000.0 |
Rhode Island | 83 | 37622.289157 | 34000.0 |
South Dakota | 244 | 43907.377049 | 35000.0 |
# 查看数据分布
total_df['ProductGroupDesc'].fillna('').value_counts()
Track Excavators 104230
Track Type Tractors 82582
Backhoe Loaders 81401
Wheel Loader 73216
Skid Steer Loaders 45011
Motor Graders 26258
Name: ProductGroupDesc, dtype: int64
# 查看和价格的关系
temp = pd.pivot_table(total_df, index='ProductGroupDesc', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count | mean | median | |
---|---|---|---|
SalePrice | SalePrice | SalePrice | |
ProductGroupDesc | |||
Motor Graders | 26258 | 47561.957422 | 40000.0 |
Skid Steer Loaders | 45011 | 10583.944014 | 10000.0 |
Wheel Loader | 73216 | 37259.292422 | 32000.0 |
Backhoe Loaders | 81401 | 20951.582327 | 20500.0 |
Track Type Tractors | 82582 | 36280.132780 | 29500.0 |
Track Excavators | 104230 | 35763.457018 | 29000.0 |
total_df['Drive_System'].fillna('').value_counts()
305611
Two Wheel Drive 47546
Four Wheel Drive 33551
No 25166
All Wheel Drive 824
Name: Drive_System, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Drive_System', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count | mean | median | |
---|---|---|---|
SalePrice | SalePrice | SalePrice | |
Drive_System | |||
All Wheel Drive | 824 | 60455.279126 | 59000.0 |
No | 25166 | 47342.721450 | 40000.0 |
Four Wheel Drive | 33551 | 24534.149414 | 24000.0 |
Two Wheel Drive | 47546 | 18418.026879 | 17500.0 |
305611 | 32532.703693 | 25000.0 |
total_df['Enclosure'].fillna('').value_counts()
OROPS 177971
EROPS 141769
EROPS w AC 92601
334
EROPS AC 18
NO ROPS 3
None or Unspecified 2
Name: Enclosure, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Enclosure', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count | mean | median | |
---|---|---|---|
SalePrice | SalePrice | SalePrice | |
Enclosure | |||
None or Unspecified | 2 | 16500.000000 | 16500.0 |
NO ROPS | 3 | 44333.333333 | 42500.0 |
EROPS AC | 18 | 23500.000000 | 20500.0 |
334 | 27689.446108 | 25000.0 | |
EROPS w AC | 92601 | 51671.169728 | 47000.0 |
EROPS | 141769 | 28687.993278 | 23000.0 |
OROPS | 177971 | 22592.082739 | 18000.0 |
total_df['Forks'].fillna('').value_counts()
214983
None or Unspecified 183061
Yes 14654
Name: Forks, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Forks', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count | mean | median | |
---|---|---|---|
SalePrice | SalePrice | SalePrice | |
Forks | |||
Yes | 14654 | 36761.354715 | 31500.0 |
None or Unspecified | 183061 | 23593.414184 | 18500.0 |
214983 | 37327.174954 | 30000.0 |
total_df['Pad_Type'].fillna('').value_counts()
331602
None or Unspecified 72395
Reversible 5950
Street 2725
Grouser 26
Name: Pad_Type, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Pad_Type', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count | mean | median | |
---|---|---|---|
SalePrice | SalePrice | SalePrice | |
Pad_Type | |||
Grouser | 26 | 30289.423077 | 30500.0 |
Street | 2725 | 24995.064220 | 25000.0 |
Reversible | 5950 | 27344.033613 | 27500.0 |
None or Unspecified | 72395 | 20267.120354 | 20000.0 |
331602 | 33725.998897 | 26000.0 |
total_df['Ride_Control'].fillna('').value_counts()
259970
No 79389
None or Unspecified 64693
Yes 8646
Name: Ride_Control, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Ride_Control', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count | mean | median | |
---|---|---|---|
SalePrice | SalePrice | SalePrice | |
Ride_Control | |||
Yes | 8646 | 55278.269142 | 53000.0 |
None or Unspecified | 64693 | 34834.650024 | 29000.0 |
No | 79389 | 20765.720100 | 20000.0 |
259970 | 32705.232362 | 25000.0 |
total_df['Stick'].fillna('').value_counts()
331602
Standard 49854
Extended 31242
Name: Stick, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Stick', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count | mean | median | |
---|---|---|---|
SalePrice | SalePrice | SalePrice | |
Stick | |||
Extended | 31242 | 23257.320914 | 23000.0 |
Standard | 49854 | 19501.525113 | 19000.0 |
331602 | 33725.998897 | 26000.0 |
total_df['Transmission'].fillna('').value_counts()
224691
Standard 143915
None or Unspecified 23889
Powershift 11991
Powershuttle 4286
Hydrostatic 3342
Direct Drive 422
Autoshift 118
AutoShift 44
Name: Transmission, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Transmission', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count | mean | median | |
---|---|---|---|
SalePrice | SalePrice | SalePrice | |
Transmission | |||
AutoShift | 44 | 102261.363636 | 95000.0 |
Autoshift | 118 | 33141.949153 | 33250.0 |
Direct Drive | 422 | 18675.829384 | 11550.0 |
Hydrostatic | 3342 | 34705.814183 | 32000.0 |
Powershuttle | 4286 | 19264.930938 | 18000.0 |
Powershift | 11991 | 38574.352931 | 29500.0 |
None or Unspecified | 23889 | 47354.812968 | 40000.0 |
Standard | 143915 | 28364.132218 | 23000.0 |
224691 | 31117.253842 | 23000.0 |
total_df['Turbocharged'].fillna('').value_counts()
331602
None or Unspecified 77111
Yes 3985
Name: Turbocharged, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Turbocharged', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count | mean | median | |
---|---|---|---|
SalePrice | SalePrice | SalePrice | |
Turbocharged | |||
Yes | 3985 | 24144.300376 | 24000.0 |
None or Unspecified | 77111 | 20783.276264 | 20000.0 |
331602 | 33725.998897 | 26000.0 |
total_df['Blade_Extension'].fillna('').value_counts()
386715
None or Unspecified 25406
Yes 577
Name: Blade_Extension, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Blade_Extension', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count | mean | median | |
---|---|---|---|
SalePrice | SalePrice | SalePrice | |
Blade_Extension | |||
Yes | 577 | 66587.694974 | 65000.0 |
None or Unspecified | 25406 | 47337.075022 | 40000.0 |
386715 | 30103.244279 | 23500.0 |
total_df['Blade_Width'].fillna('').value_counts()
386715
14' 9867
None or Unspecified 9521
12' 5201
16' 960
13' 335
<12' 99
Name: Blade_Width, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Blade_Width', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count | mean | median | |
---|---|---|---|
SalePrice | SalePrice | SalePrice | |
Blade_Width | |||
<12' | 99 | 27764.646465 | 21000.0 |
13' | 335 | 27125.522388 | 18500.0 |
16' | 960 | 59882.119792 | 52750.0 |
12' | 5201 | 36000.852144 | 29000.0 |
None or Unspecified | 9521 | 41899.460141 | 32000.0 |
14' | 9867 | 59347.223168 | 55000.0 |
386715 | 30103.244279 | 23500.0 |
total_df['Enclosure_Type'].fillna('').value_counts()
386715
None or Unspecified 22469
Low Profile 2675
High Profile 839
Name: Enclosure_Type, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Enclosure_Type', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count | mean | median | |
---|---|---|---|
SalePrice | SalePrice | SalePrice | |
Enclosure_Type | |||
High Profile | 839 | 81897.437426 | 80000.0 |
Low Profile | 2675 | 81444.658318 | 80000.0 |
None or Unspecified | 22469 | 42480.324759 | 35000.0 |
386715 | 30103.244279 | 23500.0 |
total_df['Engine_Horsepower'].fillna('').value_counts()
386715
No 24642
Variable 1341
Name: Engine_Horsepower, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Engine_Horsepower', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count | mean | median | |
---|---|---|---|
SalePrice | SalePrice | SalePrice | |
Engine_Horsepower | |||
Variable | 1341 | 90245.152871 | 88000.0 |
No | 24642 | 45452.807321 | 38000.0 |
386715 | 30103.244279 | 23500.0 |
total_df['Hydraulics'].fillna('').value_counts()
2 Valve 145317
Standard 106515
82565
Auxiliary 43224
Base + 1 Function 25511
3 Valve 5807
4 Valve 3077
Base + 3 Function 311
Base + 2 Function 132
Base + 5 Function 94
Base + 4 Function 81
Base + 6 Function 54
None or Unspecified 10
Name: Hydraulics, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Hydraulics', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count | mean | median | |
---|---|---|---|
SalePrice | SalePrice | SalePrice | |
Hydraulics | |||
None or Unspecified | 10 | 27400.000000 | 26250.0 |
Base + 6 Function | 54 | 68333.333333 | 69500.0 |
Base + 4 Function | 81 | 95253.086420 | 92500.0 |
Base + 5 Function | 94 | 75601.063830 | 71000.0 |
Base + 2 Function | 132 | 89424.242424 | 89500.0 |
Base + 3 Function | 311 | 84141.800643 | 79000.0 |
4 Valve | 3077 | 56652.551186 | 51000.0 |
3 Valve | 5807 | 44479.309454 | 40000.0 |
Base + 1 Function | 25511 | 46637.433970 | 39000.0 |
Auxiliary | 43224 | 25076.746784 | 16000.0 |
82565 | 21028.900297 | 20500.0 | |
Standard | 106515 | 29391.410139 | 20500.0 |
2 Valve | 145317 | 36145.196393 | 30000.0 |
total_df['Pushblock'].fillna('').value_counts()
386715
None or Unspecified 20017
Yes 5966
Name: Pushblock, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Pushblock', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count | mean | median | |
---|---|---|---|
SalePrice | SalePrice | SalePrice | |
Pushblock | |||
Yes | 5966 | 68305.127724 | 65000.0 |
None or Unspecified | 20017 | 41642.525653 | 33000.0 |
386715 | 30103.244279 | 23500.0 |
total_df['Ripper'].fillna('').value_counts()
305753
None or Unspecified 85405
Yes 8185
Multi Shank 8071
Single Shank 5284
Name: Ripper, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Ripper', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count | mean | median | |
---|---|---|---|
SalePrice | SalePrice | SalePrice | |
Ripper | |||
Single Shank | 5284 | 50139.366389 | 42000.0 |
Multi Shank | 8071 | 48952.397472 | 41000.0 |
Yes | 8185 | 63677.683079 | 60000.0 |
None or Unspecified | 85405 | 35253.466486 | 28500.0 |
305753 | 28422.902101 | 22000.0 |
total_df['Scarifier'].fillna('').value_counts()
386704
None or Unspecified 13033
Yes 12961
Name: Scarifier, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Scarifier', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count | mean | median | |
---|---|---|---|
SalePrice | SalePrice | SalePrice | |
Scarifier | |||
Yes | 12961 | 49523.266415 | 40000.0 |
None or Unspecified | 13033 | 45996.798281 | 40000.0 |
386704 | 30103.375220 | 23500.0 |
total_df['Tip_Control'].fillna('').value_counts()
386715
None or Unspecified 16832
Sideshift & Tip 7164
Tip 1987
Name: Tip_Control, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Tip_Control', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count | mean | median | |
---|---|---|---|
SalePrice | SalePrice | SalePrice | |
Tip_Control | |||
Tip | 1987 | 62269.265727 | 57000.0 |
Sideshift & Tip | 7164 | 46739.039084 | 41000.0 |
None or Unspecified | 16832 | 46488.790459 | 37500.0 |
386715 | 30103.244279 | 23500.0 |
total_df['Tire_Size'].fillna('').value_counts()
315060
None or Unspecified 47823
20.5 15773
14" 9111
23.5 8760
26.5 4635
17.5 3971
29.5 2767
17.5" 1815
13" 776
20.5" 737
15.5 610
15.5" 463
23.5" 309
7.0" 56
23.1" 20
10" 9
10 inch 3
Name: Tire_Size, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Tire_Size', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count | mean | median | |
---|---|---|---|
SalePrice | SalePrice | SalePrice | |
Tire_Size | |||
10 inch | 3 | 16333.333333 | 15500.0 |
10" | 9 | 31166.666667 | 25000.0 |
23.1" | 20 | 14837.500000 | 12125.0 |
7.0" | 56 | 27625.000000 | 25500.0 |
23.5" | 309 | 56487.944984 | 51000.0 |
15.5" | 463 | 47401.727862 | 40000.0 |
15.5 | 610 | 17153.442623 | 15000.0 |
20.5" | 737 | 74012.632293 | 68000.0 |
13" | 776 | 21085.373711 | 16000.0 |
17.5" | 1815 | 67478.650138 | 65000.0 |
29.5 | 2767 | 46650.813155 | 43000.0 |
17.5 | 3971 | 25344.856459 | 22500.0 |
26.5 | 4635 | 50903.011866 | 50000.0 |
23.5 | 8760 | 47236.347032 | 44000.0 |
14" | 9111 | 50506.226869 | 44000.0 |
20.5 | 15773 | 41099.197997 | 38000.0 |
None or Unspecified | 47823 | 35357.008218 | 27000.0 |
315060 | 28433.535937 | 22000.0 |
total_df['Coupler'].fillna('').value_counts()
192019
None or Unspecified 190449
Manual 23918
Hydraulic 6312
Name: Coupler, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Coupler', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count | mean | median | |
---|---|---|---|
SalePrice | SalePrice | SalePrice | |
Coupler | |||
Hydraulic | 6312 | 46839.729087 | 43000.0 |
Manual | 23918 | 33493.829877 | 27000.0 |
None or Unspecified | 190449 | 30392.948847 | 22500.0 |
192019 | 31233.255205 | 24000.0 |
total_df['Coupler_System'].fillna('').value_counts()
367724
None or Unspecified 41727
Yes 3247
Name: Coupler_System, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Coupler_System', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count | mean | median | |
---|---|---|---|
SalePrice | SalePrice | SalePrice | |
Coupler_System | |||
Yes | 3247 | 11593.701879 | 11500.0 |
None or Unspecified | 41727 | 10504.809931 | 10000.0 |
367724 | 33738.521242 | 26000.0 |
total_df['Grouser_Tracks'].fillna('').value_counts()
367823
None or Unspecified 41820
Yes 3055
Name: Grouser_Tracks, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Grouser_Tracks', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count | mean | median | |
---|---|---|---|
SalePrice | SalePrice | SalePrice | |
Grouser_Tracks | |||
Yes | 3055 | 12871.268412 | 13000.0 |
None or Unspecified | 41820 | 10410.951196 | 9750.0 |
367823 | 33732.896625 | 26000.0 |
total_df['Hydraulics_Flow'].fillna('').value_counts()
367823
Standard 44251
High Flow 597
None or Unspecified 27
Name: Hydraulics_Flow, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Hydraulics_Flow', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count | mean | median | |
---|---|---|---|
SalePrice | SalePrice | SalePrice | |
Hydraulics_Flow | |||
None or Unspecified | 27 | 15053.703704 | 15000.0 |
High Flow | 597 | 13183.165829 | 13000.0 |
Standard | 44251 | 10540.573185 | 10000.0 |
367823 | 33732.896625 | 26000.0 |
total_df['Track_Type'].fillna('').value_counts()
310505
Steel 87463
Rubber 14730
Name: Track_Type, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Track_Type', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count | mean | median | |
---|---|---|---|
SalePrice | SalePrice | SalePrice | |
Track_Type | |||
Rubber | 14730 | 15823.905906 | 13500.0 |
Steel | 87463 | 39245.894733 | 34000.0 |
310505 | 29683.235742 | 22500.0 |
total_df['Undercarriage_Pad_Width'].fillna('').value_counts()
309782
None or Unspecified 82444
32 inch 5287
28 inch 3152
24 inch 2998
20 inch 2664
30 inch 1602
36 inch 1544
18 inch 1439
34 inch 540
16 inch 481
31 inch 191
27 inch 144
22 inch 135
26 inch 98
33 inch 94
14 inch 51
15 inch 33
25 inch 17
31.5 inch 2
Name: Undercarriage_Pad_Width, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Undercarriage_Pad_Width', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count | mean | median | |
---|---|---|---|
SalePrice | SalePrice | SalePrice | |
Undercarriage_Pad_Width | |||
31.5 inch | 2 | 108000.000000 | 108000.0 |
25 inch | 17 | 22591.176471 | 20000.0 |
15 inch | 33 | 16803.030303 | 15000.0 |
14 inch | 51 | 16869.607843 | 14500.0 |
33 inch | 94 | 51848.936170 | 47500.0 |
26 inch | 98 | 30831.632653 | 28750.0 |
22 inch | 135 | 33398.518519 | 26500.0 |
27 inch | 144 | 33826.041667 | 28625.0 |
31 inch | 191 | 48378.795812 | 44500.0 |
16 inch | 481 | 20531.702703 | 17500.0 |
34 inch | 540 | 61289.722222 | 60000.0 |
18 inch | 1439 | 20302.710215 | 18000.0 |
36 inch | 1544 | 46214.297280 | 41000.0 |
30 inch | 1602 | 36943.008739 | 30000.0 |
20 inch | 2664 | 29552.346096 | 27500.0 |
24 inch | 2998 | 37054.419613 | 31500.0 |
28 inch | 3152 | 37732.447652 | 33500.0 |
32 inch | 5287 | 48901.324002 | 45500.0 |
None or Unspecified | 82444 | 35017.391514 | 28000.0 |
309782 | 29688.370739 | 22500.0 |
total_df['Stick_Length'].fillna('').value_counts()
310437
None or Unspecified 81539
9' 6" 5832
10' 6" 3519
11' 0" 1601
9' 10" 1463
9' 8" 1462
9' 7" 1423
12' 10" 1087
10' 2" 1004
8' 6" 908
8' 2" 614
10' 10" 414
12' 8" 322
11' 10" 307
8' 4" 274
8' 10" 104
12' 4" 103
9' 5" 101
15' 9" 87
6' 3" 51
13' 7" 11
14' 1" 7
13' 10" 7
13' 9" 7
19' 8" 5
7' 10" 3
15' 4" 3
24' 3" 2
9' 2" 1
Name: Stick_Length, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Stick_Length', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count | mean | median | |
---|---|---|---|
SalePrice | SalePrice | SalePrice | |
Stick_Length | |||
15' 9" | 87 | 57068.965517 | 55000.0 |
9' 5" | 101 | 46074.257426 | 46000.0 |
12' 4" | 103 | 47505.339806 | 40000.0 |
8' 10" | 104 | 43937.500000 | 42750.0 |
8' 4" | 274 | 33362.226277 | 28000.0 |
11' 10" | 307 | 48237.785016 | 42500.0 |
12' 8" | 322 | 58167.701863 | 56000.0 |
10' 10" | 414 | 60149.275362 | 57000.0 |
8' 2" | 614 | 33933.306189 | 31000.0 |
8' 6" | 908 | 36311.949339 | 31000.0 |
10' 2" | 1004 | 45855.229084 | 44000.0 |
12' 10" | 1087 | 66730.174793 | 66000.0 |
9' 7" | 1423 | 56526.528461 | 55000.0 |
9' 8" | 1462 | 49643.296854 | 47000.0 |
9' 10" | 1463 | 39891.148325 | 36500.0 |
11' 0" | 1601 | 48882.151780 | 45000.0 |
10' 6" | 3519 | 53677.642796 | 50000.0 |
9' 6" | 5832 | 46655.847051 | 42500.0 |
None or Unspecified | 81539 | 32556.659697 | 26000.0 |
310437 | 29683.170383 | 22500.0 |
total_df['Thumb'].fillna('').value_counts()
310366
None or Unspecified 85074
Manual 9678
Hydraulic 7580
Name: Thumb, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Thumb', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count | mean | median | |
---|---|---|---|
SalePrice | SalePrice | SalePrice | |
Thumb | |||
Hydraulic | 7580 | 38316.014644 | 33000.0 |
Manual | 9678 | 39451.081318 | 34000.0 |
None or Unspecified | 85074 | 35234.459047 | 28000.0 |
310366 | 29683.224368 | 22500.0 |
total_df['Pattern_Changer'].fillna('').value_counts()
310437
None or Unspecified 92924
Yes 9269
No 68
Name: Pattern_Changer, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Pattern_Changer', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count | mean | median | |
---|---|---|---|
SalePrice | SalePrice | SalePrice | |
Pattern_Changer | |||
No | 68 | 29981.617647 | 27500.0 |
Yes | 9269 | 52301.261085 | 52000.0 |
None or Unspecified | 92924 | 34230.870776 | 28000.0 |
310437 | 29683.170383 | 22500.0 |
total_df['Grouser_Type'].fillna('').value_counts()
310505
Double 86998
Triple 15193
Single 2
Name: Grouser_Type, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Grouser_Type', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count | mean | median | |
---|---|---|---|
SalePrice | SalePrice | SalePrice | |
Grouser_Type | |||
Single | 2 | 52000.000000 | 52000.0 |
Triple | 15193 | 42755.055947 | 37500.0 |
Double | 86998 | 34667.098784 | 28000.0 |
310505 | 29683.235742 | 22500.0 |
total_df['Backhoe_Mounting'].fillna('').value_counts()
331986
None or Unspecified 80692
Yes 20
Name: Backhoe_Mounting, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Backhoe_Mounting', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count | mean | median | |
---|---|---|---|
SalePrice | SalePrice | SalePrice | |
Backhoe_Mounting | |||
Yes | 20 | 16462.500000 | 15500.0 |
None or Unspecified | 80692 | 36486.290155 | 30000.0 |
331986 | 29934.882688 | 22500.0 |
total_df['Blade_Type'].fillna('').value_counts()
330823
PAT 39633
Straight 13461
None or Unspecified 11841
Semi U 8907
VPAT 3681
U 1888
Angle 1684
No 743
Landfill 26
Coal 11
Name: Blade_Type, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Blade_Type', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count | mean | median | |
---|---|---|---|
SalePrice | SalePrice | SalePrice | |
Blade_Type | |||
Coal | 11 | 50227.272727 | 48000.0 |
Landfill | 26 | 69942.307692 | 63750.0 |
No | 743 | 25569.078062 | 21000.0 |
Angle | 1684 | 32742.923990 | 25500.0 |
U | 1888 | 44536.493644 | 35000.0 |
VPAT | 3681 | 62566.143711 | 59000.0 |
Semi U | 8907 | 58495.607387 | 55000.0 |
None or Unspecified | 11841 | 31077.752149 | 25000.0 |
Straight | 13461 | 35548.990120 | 28000.0 |
PAT | 39633 | 30679.544117 | 27000.0 |
330823 | 29949.806359 | 22500.0 |
total_df['Travel_Controls'].fillna('').value_counts()
330821
None or Unspecified 71447
Differential Steer 5257
Finger Tip 2693
2 Pedal 1144
Lever 902
Pedal 423
1 Speed 11
Name: Travel_Controls, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Travel_Controls', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count | mean | median | |
---|---|---|---|
SalePrice | SalePrice | SalePrice | |
Travel_Controls | |||
1 Speed | 11 | 15672.727273 | 11500.0 |
Pedal | 423 | 24667.966903 | 22500.0 |
Lever | 902 | 33187.900222 | 26000.0 |
2 Pedal | 1144 | 25588.002622 | 21000.0 |
Finger Tip | 2693 | 57974.053101 | 55000.0 |
Differential Steer | 5257 | 68752.648849 | 67500.0 |
None or Unspecified | 71447 | 33407.344104 | 27500.0 |
330821 | 29950.385598 | 22500.0 |
total_df['Differential_Type'].fillna('').value_counts()
341134
Standard 70169
Limited Slip 1181
No Spin 212
Locking 2
Name: Differential_Type, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Differential_Type', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count | mean | median | |
---|---|---|---|
SalePrice | SalePrice | SalePrice | |
Differential_Type | |||
Locking | 2 | 66000.000000 | 66000.0 |
No Spin | 212 | 52889.386792 | 51475.0 |
Limited Slip | 1181 | 57566.553768 | 57000.0 |
Standard | 70169 | 37061.729952 | 32000.0 |
341134 | 29907.683667 | 22500.0 |
total_df['Steering_Controls'].fillna('').value_counts()
341176
Conventional 70774
Command Control 594
Four Wheel Standard 139
Wheel 14
No 1
Name: Steering_Controls, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Steering_Controls', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count | mean | median | |
---|---|---|---|
SalePrice | SalePrice | SalePrice | |
Steering_Controls | |||
No | 1 | 17500.000000 | 17500.0 |
Wheel | 14 | 21517.857143 | 17500.0 |
Four Wheel Standard | 139 | 24658.273381 | 22000.0 |
Command Control | 594 | 74774.410774 | 75000.0 |
Conventional | 70774 | 37168.659098 | 32000.0 |
341176 | 29907.455419 | 22500.0 |
total_df['fiManufacturerID'].fillna('').value_counts()
26 169003
43 74527
25 42142
103 38928
121 25033
...
5 2
923 1
1518 1
112 1
525 1
Name: fiManufacturerID, Length: 104, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='fiManufacturerID', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count | mean | median | |
---|---|---|---|
SalePrice | SalePrice | SalePrice | |
fiManufacturerID | |||
405 | 1070 | 10707.691589 | 9500.0 |
129 | 1117 | 14373.209490 | 10000.0 |
86 | 1178 | 34583.191851 | 30000.0 |
46 | 1344 | 17102.771577 | 15000.0 |
158 | 1378 | 30334.216255 | 26000.0 |
95 | 1628 | 34679.652948 | 31000.0 |
166 | 1906 | 23662.644281 | 20000.0 |
135 | 2553 | 18356.462593 | 14500.0 |
54 | 2585 | 18324.776402 | 15000.0 |
750 | 3305 | 12966.822995 | 11500.0 |
176 | 4224 | 43841.872633 | 40000.0 |
99 | 5087 | 33323.884411 | 28000.0 |
92 | 5260 | 21678.645437 | 18500.0 |
55 | 5902 | 13283.747035 | 11750.0 |
74 | 11291 | 33328.462935 | 27000.0 |
121 | 25033 | 10637.495067 | 10000.0 |
103 | 38928 | 33710.587988 | 27500.0 |
25 | 42142 | 21210.952589 | 18000.0 |
43 | 74527 | 27908.815087 | 23000.0 |
26 | 169003 | 40321.710425 | 33000.0 |
total_df['PrimarySizeBasis'].fillna('').value_counts()
Horsepower 180262
Weight - Metric Tons 105019
Standard Digging Depth - Ft 77848
Operating Capacity - Lbs 44226
5259
Model 78
Weight - Metric 4
Weight - Lbs 1
Cutting Width - Inches 1
Name: PrimarySizeBasis, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='PrimarySizeBasis', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count | mean | median | |
---|---|---|---|
SalePrice | SalePrice | SalePrice | |
PrimarySizeBasis | |||
Cutting Width - Inches | 1 | 23000.000000 | 23000.0 |
Weight - Lbs | 1 | 27500.000000 | 27500.0 |
Weight - Metric | 4 | 22562.500000 | 22750.0 |
Model | 78 | 25535.256410 | 17750.0 |
5259 | 15941.918616 | 14000.0 | |
Operating Capacity - Lbs | 44226 | 10612.493872 | 10000.0 |
Standard Digging Depth - Ft | 77848 | 21202.527078 | 21000.0 |
Weight - Metric Tons | 105019 | 35654.973148 | 29000.0 |
Horsepower | 180262 | 38455.691062 | 32000.0 |
total_df['PrimaryLower'].fillna('').value_counts()
14.0 62037
20.0 17943
130.0 17113
150.0 15322
85.0 15054
...
1.8 3
4.5 2
25000.0 1
2.7 1
300.0 1
Name: PrimaryLower, Length: 76, dtype: int64
**总结:**发现的规律如下:
从EXCEL表中可以明显看出部分特征是要么一起出现,要么一起空,这些可以在创建衍生特征时做成交叉特征。
部分特征虽然是object字段,但是能排出顺序,比如“Low、Medium、High”等,和‘xx inch’等,后面处理成定序变量。
空值都单独用一列标记。
没有特殊规律的类别特征统一用Label Encoder编码。
与价格的透视表可以为分箱提供依据。
EDA到此堂堂完结!!!!!!!!!!!!!!!!!!!!!!!
def to_sin(n, i):
return round(np.sin((2*np.pi/n)*(i-1)+(2*np.pi/7)) + 1, 2)
total_df['SaleYear'] = total_df['SaleDate'].dt.year
total_df['SaleMonth'] = total_df['SaleDate'].dt.month
total_df['SaleDay'] = total_df['SaleDate'].dt.day
total_df['SaleDayOfWeek'] = total_df['SaleDate'].dt.dayofweek
total_df['SaleDayOfYear'] = total_df['SaleDate'].dt.dayofyear
# 删除原来的SaleDate特征
total_df.drop('SaleDate', axis=1, inplace=True)
# 尝试余弦化拉近1月和12月的距离
total_df['SaleDayOfWeek_sin'] = total_df['SaleDayOfWeek'].map(lambda x: to_sin(7, x)).value_counts()
total_df['SaleMonth_sin'] = total_df['SaleMonth'].map(lambda x: to_sin(12, x)).value_counts()
def combine_features(df, col_list):
temp = df[col_list[0]].astype(str)
for col in col_list[1:]:
temp += df[col].astype(str)
return temp
total_df['Stick__Turbocharged'] = combine_features(total_df, ['Stick', 'Turbocharged'])
total_df['Stick__Turbocharged'].value_counts()
nannan 331602
StandardNone or Unspecified 47981
ExtendedNone or Unspecified 29130
ExtendedYes 2112
StandardYes 1873
Name: Stick__Turbocharged, dtype: int64
total_df['Blade__Blade__Enclosure__Engine'] = combine_features(total_df, ['Blade_Extension', 'Blade_Width', 'Enclosure_Type', 'Engine_Horsepower'])
total_df['Blade__Blade__Enclosure__Engine'].value_counts()
nannannannan 386715
None or UnspecifiedNone or UnspecifiedNone or UnspecifiedNo 8572
None or Unspecified14'None or UnspecifiedNo 7307
None or Unspecified12'None or UnspecifiedNo 4466
None or Unspecified14'Low ProfileNo 1088
None or Unspecified16'None or UnspecifiedNo 818
None or UnspecifiedNone or UnspecifiedLow ProfileNo 435
None or Unspecified12'Low ProfileNo 416
None or Unspecified14'Low ProfileVariable 377
None or Unspecified14'High ProfileNo 348
None or Unspecified14'None or UnspecifiedVariable 343
None or Unspecified13'None or UnspecifiedNo 313
None or UnspecifiedNone or UnspecifiedNone or UnspecifiedVariable 190
Yes14'None or UnspecifiedNo 126
None or Unspecified14'High ProfileVariable 122
None or UnspecifiedNone or UnspecifiedHigh ProfileNo 112
None or Unspecified<12'None or UnspecifiedNo 99
None or Unspecified12'High ProfileNo 96
YesNone or UnspecifiedNone or UnspecifiedNo 86
Yes12'None or UnspecifiedNo 84
None or UnspecifiedNone or UnspecifiedLow ProfileVariable 74
None or Unspecified16'Low ProfileNo 58
Yes14'Low ProfileNo 55
Yes14'Low ProfileVariable 50
Yes12'Low ProfileNo 47
None or Unspecified12'High ProfileVariable 43
None or Unspecified16'High ProfileNo 34
None or UnspecifiedNone or UnspecifiedHigh ProfileVariable 27
None or Unspecified12'Low ProfileVariable 21
Yes14'High ProfileVariable 21
Yes14'None or UnspecifiedVariable 18
None or Unspecified12'None or UnspecifiedVariable 17
None or Unspecified13'Low ProfileNo 16
Yes16'Low ProfileNo 15
Yes14'High ProfileNo 12
Yes16'None or UnspecifiedNo 10
YesNone or UnspecifiedLow ProfileNo 9
YesNone or UnspecifiedNone or UnspecifiedVariable 9
Yes16'High ProfileNo 6
Yes12'High ProfileNo 6
YesNone or UnspecifiedLow ProfileVariable 4
None or Unspecified16'High ProfileVariable 4
None or Unspecified16'None or UnspecifiedVariable 4
None or Unspecified16'Low ProfileVariable 4
Yes13'None or UnspecifiedNo 4
Yes16'Low ProfileVariable 3
Yes12'Low ProfileVariable 3
YesNone or UnspecifiedHigh ProfileNo 3
Yes12'High ProfileVariable 2
Yes16'High ProfileVariable 2
Yes16'None or UnspecifiedVariable 2
None or Unspecified13'None or UnspecifiedVariable 1
None or Unspecified13'High ProfileNo 1
Name: Blade__Blade__Enclosure__Engine, dtype: int64
total_df['Pushblock__Scarifier__TipControl'] = combine_features(total_df, ['Pushblock', 'Scarifier', 'Tip_Control'])
total_df['Pushblock__Scarifier__TipControl'].value_counts()
nannannan 386704
None or UnspecifiedYesNone or Unspecified 6742
None or UnspecifiedNone or UnspecifiedNone or Unspecified 6228
None or UnspecifiedYesSideshift & Tip 3088
YesNone or UnspecifiedNone or Unspecified 2560
None or UnspecifiedNone or UnspecifiedSideshift & Tip 2401
YesYesNone or Unspecified 1302
YesNone or UnspecifiedSideshift & Tip 1182
None or UnspecifiedYesTip 1103
YesYesSideshift & Tip 493
None or UnspecifiedNone or UnspecifiedTip 455
YesYesTip 226
YesNone or UnspecifiedTip 203
nanYesnan 7
nanNone or Unspecifiednan 4
Name: Pushblock__Scarifier__TipControl, dtype: int64
total_df['CouplerSystem__GrouserTracks__HydraulicsFlow'] = combine_features(total_df, ['Coupler_System', 'Grouser_Tracks', 'Hydraulics_Flow'])
total_df['CouplerSystem__GrouserTracks__HydraulicsFlow'].value_counts()
nannannan 367724
None or UnspecifiedNone or UnspecifiedStandard 39040
YesNone or UnspecifiedStandard 2289
None or UnspecifiedYesStandard 2149
YesYesStandard 773
None or UnspecifiedNone or UnspecifiedHigh Flow 364
YesNone or UnspecifiedHigh Flow 122
None or Unspecifiednannan 91
None or UnspecifiedYesHigh Flow 57
YesYesHigh Flow 54
None or UnspecifiedYesNone or Unspecified 22
Yesnannan 8
None or UnspecifiedNone or UnspecifiedNone or Unspecified 4
YesNone or UnspecifiedNone or Unspecified 1
Name: CouplerSystem__GrouserTracks__HydraulicsFlow, dtype: int64
col_list = ('Track_Type Undercarriage_Pad_Width Stick_Length Thumb Pattern_Changer Grouser_Type').split('\t')
total_df['TUSTPG'] = combine_features(total_df, col_list)
total_df['TUSTPG'].value_counts()
nannannannannannan 309643
SteelNone or UnspecifiedNone or UnspecifiedNone or UnspecifiedNone or UnspecifiedDouble 38336
RubberNone or UnspecifiedNone or UnspecifiedNone or UnspecifiedNone or UnspecifiedDouble 10031
SteelNone or UnspecifiedNone or UnspecifiedManualNone or UnspecifiedDouble 3900
SteelNone or UnspecifiedNone or UnspecifiedNone or UnspecifiedNone or UnspecifiedTriple 3214
...
Steel32 inch8' 2"ManualNone or UnspecifiedDouble 1
Steel30 inch12' 10"ManualNone or UnspecifiedTriple 1
Steel20 inch8' 4"ManualNone or UnspecifiedTriple 1
Steel32 inch8' 4"None or UnspecifiedNone or UnspecifiedTriple 1
Rubber30 inchNone or UnspecifiedHydraulicNone or UnspecifiedDouble 1
Name: TUSTPG, Length: 1065, dtype: int64
col_list = ('Backhoe_Mounting Blade_Type Travel_Controls').split('\t')
total_df['BBT'] = combine_features(total_df, col_list)
total_df['BBT'].value_counts()
nannannan 330802
None or UnspecifiedPATNone or Unspecified 37249
None or UnspecifiedStraightNone or Unspecified 11990
None or UnspecifiedNone or UnspecifiedNone or Unspecified 11106
None or UnspecifiedSemi UNone or Unspecified 6582
None or UnspecifiedSemi UDifferential Steer 2122
None or UnspecifiedVPATFinger Tip 1857
None or UnspecifiedUNone or Unspecified 1816
None or UnspecifiedAngleNone or Unspecified 1560
None or UnspecifiedVPATNone or Unspecified 1092
None or UnspecifiedPATDifferential Steer 1011
None or UnspecifiedStraightDifferential Steer 913
nanNo2 Pedal 728
None or UnspecifiedVPATDifferential Steer 708
None or UnspecifiedPATFinger Tip 584
nanStraight2 Pedal 416
None or UnspecifiedPATLever 412
None or UnspecifiedPATPedal 370
None or UnspecifiedNone or UnspecifiedDifferential Steer 327
None or UnspecifiedNone or UnspecifiedLever 285
None or UnspecifiedAngleDifferential Steer 109
None or UnspecifiedSemi UFinger Tip 97
None or UnspecifiedSemi ULever 96
None or UnspecifiedNone or UnspecifiedFinger Tip 77
None or UnspecifiedStraightFinger Tip 76
None or UnspecifiedStraightLever 56
None or UnspecifiedUDifferential Steer 54
None or UnspecifiedNone or UnspecifiedPedal 33
None or UnspecifiedVPATLever 24
nanNonan 15
None or UnspecifiedLandfillNone or Unspecified 14
None or UnspecifiedULever 14
YesNone or UnspecifiedNone or Unspecified 13
None or UnspecifiedAngleLever 13
None or UnspecifiedLandfillDifferential Steer 12
nannan1 Speed 11
nannanNone or Unspecified 10
None or UnspecifiedSemi UPedal 10
None or UnspecifiedCoalNone or Unspecified 8
YesPATNone or Unspecified 7
None or UnspecifiedStraightPedal 7
nanStraightnan 3
None or UnspecifiedUPedal 2
None or UnspecifiedCoalLever 2
None or UnspecifiedAnglePedal 1
nanUnan 1
None or UnspecifiedUFinger Tip 1
None or UnspecifiedAngleFinger Tip 1
None or UnspecifiedCoalDifferential Steer 1
Name: BBT, dtype: int64
col_list = ('Differential_Type Steering_Controls').split('\t')
total_df['DS'] = combine_features(total_df, col_list)
total_df['DS'].value_counts()
nannan 341120
StandardConventional 69411
Limited SlipConventional 1154
StandardCommand Control 564
No SpinConventional 209
StandardFour Wheel Standard 139
Standardnan 54
Limited SlipCommand Control 27
nanWheel 14
No SpinCommand Control 3
Lockingnan 2
StandardNo 1
Name: DS, dtype: int64
col_list = ('PrimarySizeBasis PrimaryLower').split('\t')
total_df['PP'] = combine_features(total_df, col_list)
total_df['PP'].value_counts()
Standard Digging Depth - Ft14.0 57395
Horsepower20.0 17941
Horsepower130.0 17113
Horsepower150.0 15299
Horsepower85.0 15054
...
Weight - Metric Tons20.0 2
Weight - Lbs25000.0 1
Cutting Width - Inches0.0 1
Weight - Metric Tons2.7 1
Weight - Metric Tons300.0 1
Name: PP, Length: 90, dtype: int64
def is_hybrid(x):
try:
return type(eval(x)) == str
except:
return True
total_df['fiBaseModel_is_hybrid'] = total_df['fiBaseModel'].map(is_hybrid)
result = total_df['ProductGroupDesc'].str.split(' ')
total_df['ProductGroupDesc_father'] = [i[-1] for i in result]
total_df['ProductGroupDesc_father'].value_counts()
Loaders 126412
Excavators 104230
Tractors 82582
Loader 73216
Graders 26258
Name: ProductGroupDesc_father, dtype: int64
def to_dummies(df, col_name):
name_dict = {value: col_name + '_is_' + str(value) for value in df[col_name].unique()}
return pd.get_dummies(df[col_name]).rename(columns=name_dict)
# col_names = ['CouplerSystem__GrouserTracks__HydraulicsFlow']
# for col in col_names:
# total_df = pd.concat([total_df, to_dummies(total_df, col)], axis=1)
# total_df.drop(columns=col_names, inplace=True)
感觉没有明显提升,暂时不做了。
usageband_dict = {
'Low': 1,
'Medium': 2,
'High': 3
}
total_df['UsageBand'] = total_df['UsageBand'].map(usageband_dict).fillna(0)
total_df['UsageBand'].value_counts()
0.0 339028
2.0 35832
1.0 25311
3.0 12527
Name: UsageBand, dtype: int64
total_df['ProductSize_is_missing'] = total_df['ProductSize'].isna()
total_df['ProductSize'].value_counts()
Medium 64342
Large / Medium 51297
Small 27057
Mini 25721
Large 21396
Compact 6280
Name: ProductSize, dtype: int64
ProductSize_dict = {
'Compact': 1,
'Mini': 2,
'Small': 3,
'Medium': 4,
'Large / Medium': 5,
'Large': 6,
}
total_df['ProductSize'] = total_df['ProductSize'].map(ProductSize_dict).fillna(0)
total_df['ProductSize'].value_counts()
0.0 216605
4.0 64342
5.0 51297
3.0 27057
2.0 25721
6.0 21396
1.0 6280
Name: ProductSize, dtype: int64
total_df['Undercarriage_Pad_Width_is_missing'] = total_df['Undercarriage_Pad_Width'].isna()
total_df['Undercarriage_Pad_Width'].value_counts()
None or Unspecified 82444
32 inch 5287
28 inch 3152
24 inch 2998
20 inch 2664
30 inch 1602
36 inch 1544
18 inch 1439
34 inch 540
16 inch 481
31 inch 191
27 inch 144
22 inch 135
26 inch 98
33 inch 94
14 inch 51
15 inch 33
25 inch 17
31.5 inch 2
Name: Undercarriage_Pad_Width, dtype: int64
total_df['Undercarriage_Pad_Width'] = total_df['Undercarriage_Pad_Width'].fillna('0 inch').map(lambda x: 1 if x == 'None or Unspecified' else eval(x[:-4]))
total_df['Undercarriage_Pad_Width'].value_counts()
0.0 309782
1.0 82444
32.0 5287
28.0 3152
24.0 2998
20.0 2664
30.0 1602
36.0 1544
18.0 1439
34.0 540
16.0 481
31.0 191
27.0 144
22.0 135
26.0 98
33.0 94
14.0 51
15.0 33
25.0 17
31.5 2
Name: Undercarriage_Pad_Width, dtype: int64
total_df['Stick_Length_is_missing'] = total_df['Stick_Length'].isna()
total_df['Stick_Length'].value_counts()
None or Unspecified 81539
9' 6" 5832
10' 6" 3519
11' 0" 1601
9' 10" 1463
9' 8" 1462
9' 7" 1423
12' 10" 1087
10' 2" 1004
8' 6" 908
8' 2" 614
10' 10" 414
12' 8" 322
11' 10" 307
8' 4" 274
8' 10" 104
12' 4" 103
9' 5" 101
15' 9" 87
6' 3" 51
13' 7" 11
14' 1" 7
13' 10" 7
13' 9" 7
19' 8" 5
7' 10" 3
15' 4" 3
24' 3" 2
9' 2" 1
Name: Stick_Length, dtype: int64
result = []
for i in total_df['Stick_Length'].fillna(0):
if i == 0:
pass
elif i == 'None or Unspecified':
i = 1
else:
a = i.find("'")
b = i.find('"')
i = round(int(i[:a]) + int(i[a+2:b])/12, 2)
result.append(i)
total_df['Stick_Length'] = result
total_df['Stick_Length'].value_counts()
0.00 310437
1.00 81539
9.50 5832
10.50 3519
11.00 1601
9.83 1463
9.67 1462
9.58 1423
12.83 1087
10.17 1004
8.50 908
8.17 614
10.83 414
12.67 322
11.83 307
8.33 274
8.83 104
12.33 103
9.42 101
15.75 87
6.25 51
13.58 11
14.08 7
13.83 7
13.75 7
19.67 5
7.83 3
15.33 3
24.25 2
9.17 1
Name: Stick_Length, dtype: int64
le = LabelEncoder()
for col in total_df.columns:
# 标记空值(事实证明用处不大。。。,只有一列标记派上了用场)
if total_df[col].isna().sum() > 1000:
total_df[col + '_is_missing'] = total_df[col].isna()
# 编码
if total_df[col].dtype == 'object':
total_df[col] = le.fit_transform(total_df[col])
# 异常值173
total_df['datasource'].map(lambda x: 172 if x == 173 else x)
0 132
1 132
2 132
3 132
4 132
...
412693 149
412694 149
412695 149
412696 149
412697 149
Name: datasource, Length: 412698, dtype: int64
# 这一列的空值用众数填充
total_df['auctioneerID'].fillna(total_df['auctioneerID'].mode()[0], inplace=True)
total_df['MachineHoursCurrentMeter'].fillna(0, inplace=True)
total_df['PrimaryLower'].fillna(0, inplace=True)
total_df.info()
Int64Index: 412698 entries, 0 to 412697
Columns: 109 entries, SalePrice to SaleMonth_sin_is_missing
dtypes: bool(40), float64(10), int32(50), int64(9)
memory usage: 157.4 MB
def drop_dup_cols(df):
cols = df.columns[60:]
n = len(cols)
for i in range(n-1):
if len(df[df[cols[i]] != df[cols[i+1]]]) < 1000:
df.drop(columns=[cols[i]], inplace=True)
drop_dup_cols(total_df)
# 以DataFrame的形式展现特征重要性
def show_feature_imp(model, df):
return pd.DataFrame({df.columns[i]: model.feature_importances_[i] for i in range(len(df.columns))}, index=['imp']).T.sort_values(by='imp', ascending=False)
# 展示两个特征重要性表的共同倒数n%的特征
def show_low_features(df1, df2, ratio):
n = int(len(df1) * ratio)
df1_features = df1.tail(n).index
df2_features = df2.tail(n).index
return n, {feature: [df1.loc[feature].values[0], df2.loc[feature].values[0]] for feature in df1_features if feature in df2_features}
# 以字典的形式显示分数,可显示的分数有:R平方、MAE、MSE、RMSLE
def show_scores(y_test, y_pred):
scores_dict = {}
scores_dict['R2'] = r2_score(y_test, y_pred)
scores_dict['MAE'] = mean_absolute_error(y_test, y_pred)
scores_dict['MSE'] = mean_squared_error(y_test, y_pred)
scores_dict['RMSLE'] = np.sqrt(mean_squared_log_error(y_test, y_pred))
return scores_dict
# 被后续的逐步回归筛出来的特征
drop_list = ['SalePrice', 'MachineID', 'PrimaryLower_is_missing',
'Drive_System_is_missing', 'Forks_is_missing', 'ProductSize_is_missing',
'Ride_Control_is_missing', 'Stick_is_missing', 'Stick_Length_is_missing', 'Transmission_is_missing',
'Turbocharged_is_missing', 'Engine_Horsepower_is_missing', 'Hydraulics_is_missing',
'Pad_Type_is_missing', 'Ripper_is_missing', 'Tip_Control_is_missing', 'Tire_Size_is_missing',
'Coupler_is_missing', 'Hydraulics_Flow_is_missing', 'Grouser_Type_is_missing',
'Backhoe_Mounting_is_missing', 'Travel_Controls_is_missing', 'Steering_Controls_is_missing',
'Pushblock_is_missing', 'SaleDayOfWeek', 'SaleMonth', 'Differential_Type',
'Stick__Turbocharged', 'SaleDayOfYear', 'MachineHoursCurrentMeter_is_missing',
'Scarifier', 'Backhoe_Mounting', 'Forks', 'Track_Type', 'Engine_Horsepower',
'Blade_Extension', 'Coupler_System', 'SaleMonth_sin', 'SaleDayOfWeek_sin', 'SaleMonth_sin_is_missing',
'fiBaseModel_is_hybrid', 'Hydraulics_Flow', 'fiModelDescriptor', 'SaleDay', 'Hydraulics', 'Pushblock', 'Transmission',
'PrimarySizeBasis', 'Stick','DS', 'Tip_Control', 'Grouser_Tracks', 'fiSecondaryDesc', 'fiModelSeries'
]
# 划分训练集、验证集
X = total_df.drop(drop_list, axis=1)
y = total_df['SalePrice']
X_train = X[X['SaleYear']<2012]
X_test = X[X['SaleYear']>=2012]
y_train = y[X['SaleYear']<2012]
y_test = y[X['SaleYear']>=2012]
raise KeyError
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Input In [1044], in ()
----> 1 raise KeyError
KeyError:
|
# rfr = RandomForestRegressor(
# n_estimators = 100,
# max_depth = 23,
# min_samples_leaf = 2,
# min_samples_split = 4,
# n_jobs = -1,
# random_state = 10
# )
# rfr.fit(X_train, y_train)
# show_scores(y_test, rfr.predict(X_test))
# rfr_imp = show_feature_imp(rfr, X_train)
# rfr_imp.head(20)
# rfr_grid = {
# 'n_estimators': [100, 150, 200],
# 'max_depth': [15, 20, 25],
# 'max_features': [0.5, 0.6, 0.7],
# 'min_samples_leaf': [2],
# 'min_samples_split': [4],
# 'max_samples': [2000],
# 'n_jobs': [-1]
# }
# gscv = GridSearchCV(RandomForestRegressor(), param_grid=rfr_grid, cv=5, verbose=True)
# gscv.fit(X_train, y_train)
xgb_grid = {
'n_estimators': [200],
'learning_rate': [0.0835],
'reg_lambda': [0.48, 0.5, 0.52]
}
gscv = GridSearchCV(XGBRegressor(), param_grid=xgb_grid)
gscv.fit(X_train, y_train)
gscv.best_params_
xgb = XGBRegressor(
n_estimators = 250, # 这个很重要
max_depth = 11, # 这个很重要
learning_rate = 0.11, # 这个很重要
reg_lambda = 0.2, # 这个比较重要
n_jobs=-1,
random_state = 13
)
xgb.fit(X_train, y_train)
show_scores(y_test, xgb.predict(X_test))
{'R2': 0.884087856079633,
'MAE': 5745.138324664967,
'MSE': 79616854.10269143,
'RMSLE': 0.23743323282056006}
总结:
调参的时候自己要先清楚哪些参数对模型影响大,最好用gscv来两次,第一次先摸清大概范围,第二次再精确到小数点后1-3位。
逐步回归筛选特征(注意,后续提到的逐步回归都以本文自定义的“青春版”逐步回归函数为准)
后向逐步回归:按照给定的特征列表(按照之前fit一遍的特征重要性排升序)遍历。每丢一个特征计算一次RMSLE,如果分降了就保持丢弃这个特征。可以设定模型、遍历次数、丢弃特征的RMSLE下降阈值。
前向逐步回归:按照给定的特征列表(按照之前fit一遍的特征重要性排序)遍历。每丢一个特征计算一次RMSLE,如果分降了就保持丢弃这个特征。可以设定模型、遍历次数、丢弃特征的RMSLE下降阈值。
如果希望更稳重,不要如此轻易就丢弃一个特征,可选的函数改进方式是改为二重循环:每一轮都将每个特征遍历一遍,计算前后的得分(或者误差)差值gain,剔除gain最小/大的特征,然后再进入下一轮,直到达到最大迭代次数或gain达到阈值。
def step_wise_reg(model, X_train, X_test, y_train, y_test, suspicion_list, method='backward', tol=0, show_info=True, dilution_ratio=1):
X_train = X_train[::dilution_ratio]
y_train = y_train[::dilution_ratio]
if method == 'backward':
drop_list = []
elif method == 'forward':
drop_list = list(suspicion_list)
else:
raise KeyError
model.fit(X_train.drop(columns=drop_list), y_train)
old_score = np.sqrt(mean_squared_log_error(y_test, pd.Series(model.predict(X_test.drop(columns=drop_list))).map(lambda x: 0 if x < 0 else x)))
for last_feature in suspicion_list:
if method == 'backward':
drop_list.append(last_feature)
else:
drop_list.remove(last_feature)
model.fit(X_train.drop(columns=drop_list), y_train)
new_score = np.sqrt(mean_squared_log_error(y_test, pd.Series(model.predict(X_test.drop(columns=drop_list))).map(lambda x: 0 if x < 0 else x)))
if show_info:
print(old_score, new_score, drop_list)
if old_score - new_score < tol:
if method == 'backward':
drop_list.remove(last_feature)
else:
drop_list.append(last_feature)
else:
old_score = new_score
return old_score, new_score, drop_list
old_score, new_score, add_drop_list1 = step_wise_reg(xgb, X_train, X_test, y_train, y_test, dilution_ratio=1, method='backward', suspicion_list=show_feature_imp(xgb, X_train).index[::-1])
add_drop_list1
old_score, new_score, add_drop_list2 = step_wise_reg(xgb, X_train, X_test, y_train, y_test, dilution_ratio=1, method='forward', tol=0.0001,suspicion_list=show_feature_imp(xgb, X_train).index[1:])
add_drop_list2
总结:
逐步回归十分好用,过一遍特征,误差就降下去了,推测特征被筛出去的原因主要是:
和其他特征的信息有一定相关性,即存在冗余信息。简单计算一下,就算全部是二分类特征,2的20次方也能表示出100万个不同的样本了,因此对于此次只有40万条数据的数据集来说,50+的特征绝对是存在冗余的,会干扰模型判断。
本身是个好特征,但没有经过恰当的分箱/编码处理,导致蕴含的信息不能传达给模型,反而起到了负提升。这就只有根据业务逻辑反复尝试了,一边尝试分箱,一边要回来调参。
本身是好特征,预处理也到位了,但模型进行逐步回归时没有同步调参,导致误差反而上升。尤其是前向逐步回归,从1个特征增加到50个特征,跨度如此之大,模型需要的参数必然是不同的。
以上2、3两点就造成了误筛。所以使用逐步回归时要注意:被筛出的特征也不一定就是没用的特征,还得自己判断。
show_scores(y_test, xgb.predict(X_test))
{'R2': 0.884087856079633,
'MAE': 5745.138324664967,
'MSE': 79616854.10269143,
'RMSLE': 0.23743323282056006}
xgb_imp = show_feature_imp(xgb, X_train)
xgb_imp
imp | |
---|---|
CouplerSystem__GrouserTracks__HydraulicsFlow | 0.397497 |
ProductSize | 0.249447 |
PP | 0.079305 |
Ride_Control | 0.042499 |
fiManufacturerID | 0.037125 |
YearMade | 0.031770 |
PrimaryLower | 0.027179 |
ProductGroupDesc_father | 0.015217 |
SaleYear | 0.012649 |
fiModelDesc | 0.012630 |
Blade_Width | 0.010005 |
Pushblock__Scarifier__TipControl | 0.009107 |
Drive_System | 0.006723 |
Ripper | 0.006685 |
Enclosure | 0.005481 |
Blade__Blade__Enclosure__Engine | 0.004808 |
Pad_Type | 0.004427 |
fiProductClassDesc | 0.004194 |
ProductGroupDesc | 0.003965 |
Travel_Controls | 0.003888 |
Tire_Size | 0.003268 |
Pattern_Changer | 0.003167 |
ModelID | 0.002938 |
UsageBand | 0.002606 |
BBT | 0.002432 |
Steering_Controls | 0.002027 |
Turbocharged | 0.001827 |
fiBaseModel | 0.001778 |
Blade_Type | 0.001742 |
Undercarriage_Pad_Width | 0.001496 |
Stick_Length | 0.001339 |
TUSTPG | 0.001334 |
Thumb | 0.001193 |
Coupler | 0.001181 |
Grouser_Type | 0.001138 |
MachineHoursCurrentMeter | 0.001100 |
state | 0.001093 |
auctioneerID_is_missing | 0.001045 |
Enclosure_Type | 0.001007 |
auctioneerID | 0.000875 |
datasource | 0.000812 |