【Python机器学习】回归模型:推土机售价预测

文章目录

  • 使用机器学习预测推土机的售价
  • 零、导入模块
  • 一、EDA
    • 1.1 查看基本信息
    • 1.2 特征类型转换
    • 1.3 联表+特征初筛
      • 1.3.1 删除包含重复信息的特征
      • 1.3.2 fiBaseModel
      • 1.3.3 fiSecondaryDesc
      • 1.3.4 fiModelSeries
      • 1.3.5 fiModelDescriptor
      • 1.3.6 ProductGroupDesc
      • 1.3.7 MfgYear
      • 1.3.8 fiManufacturerID、PrimarySizeBasis、PrimaryLower、PrimaryUpper
    • 1.4 逐个查看特征
      • 1.4.1 datasource
      • 1.4.2 auctioneerID
      • 1.4.3 YearMade
      • 1.4.4 MachineHoursCurrentMeter
      • 1.4.5 UsageBand
      • 1.4.6 fiBaseModel
      • 1.4.7 fiSecondaryDesc
      • 1.4.8 fiModelSeries
      • 1.4.9 fiModelDescriptor
      • 1.4.10 ProductSize
      • 1.4.11 state
      • 1.4.12 ProductGroupDesc
      • 1.4.13 Drive_System
      • 1.4.14 Enclosure
      • 1.4.15 Forks
      • 1.4.16 Pad_Type
      • 1.4.17 Ride_Control
      • 1.4.18 Stick
      • 1.4.19 Transmission
      • 1.4.20 Turbocharged
      • 1.4.21 Blade_Extension
      • 1.4.22 Blade_Width
      • 1.4.23 Enclosure_Type
      • 1.4.24 Engine_Horsepower
      • 1.4.24 Hydraulics
      • 1.4.25 Pushblock
      • 1.4.26 Ripper
      • 1.4.27 Scarifier
      • 1.4.28 Tip_Control
      • 1.4.29 Tire_Size
      • 1.4.30 Coupler
      • 1.4.31 Coupler_System
      • 1.4.32 Grouser_Tracks
      • 1.4.33 Hydraulics_Flow
      • 1.4.34 Track_Type
      • 1.4.35 Undercarriage_Pad_Width
      • 1.4.36 Stick_Length
      • 1.4.37 Thumb
      • 1.4.38 Pattern_Changer
      • 1.4.39 Grouser_Type
      • 1.4.40 Backhoe_Mounting
      • 1.4.41 Blade_Type
      • 1.4.42 Travel_Controls
      • 1.4.43 Differential_Type
      • 1.4.44 Steering_Controls
      • 1.4.45 fiManufacturerDesc
      • 1.4.46 PrimarySizeBasis
      • 1.4.47 PrimaryLower
  • 2. 数据清洗+数据预处理
    • 2.1 创建衍生变量
      • 2.1.1 SaleDate
      • 2.1.2 Stick、Turbocharged
      • 2.1.3 Blade_Extension、Blade_Width、Enclosure_Type、Engine_Horsepower
      • 2.1.4 Pushblock、Scarifier、Tip_Control
      • 2.1.5 Coupler_System、Grouser_Tracks、Hydraulics_Flow
      • 2.1.6 Track_Type Undercarriage_Pad_Width Stick_Length Thumb Pattern_Changer Grouser_Type
      • 2.1.7 Backhoe_Mounting Blade_Type Travel_Controls
      • 2.1.8 Differential_Type Steering_Controls
      • 2.1.9 PrimarySizeBasis PrimaryLower
      • 2.1.10 fiBaseModel
      • 2.1.11 ProductGroupDesc
    • 2.2 独热编码
    • 2.3 标签编码
      • 2.3.1 UsageBand
      • 2.3.2 ProductSize
      • 2.3.3 Undercarriage_Pad_Width
      • 2.3.4 Stick_Length
      • 2.3.2 其余随机标签编码
    • 2.4 空值和异常值
    • 2.5 删除重复列
  • 3. 建模
    • 3.1 RF
    • 3.2 XGB
  • 4. 特征筛选
  • 5. 最终成果

使用机器学习预测推土机的售价

1. 定义问题

考虑到推土机的特性,利用过去的数据,我们能多大程度上预测它未来的价格?

2. 数据来源

kaggle:https://www.kaggle.com/competitions/bluebook-for-bulldozers/overview

The data for this competition is split into three parts:

  • Train.csv is the training set, which contains data through the end of 2011.
  • Valid.csv is the validation set, which contains data from January 1, 2012 - April 30, 2012 You make predictions on this set throughout the majority of the competition. Your score on this set is used to create the public leaderboard.
  • Test.csv is the test set, which won’t be released until the last week of the competition. It contains data from May 1, 2012 - November 2012. Your score on the test set determines your final rank for the competition.

3. 评价标准

The evaluation metric for this competition is the RMSLE (root mean squared log error) between the actual and predicted auction prices.

更多信息:https://www.kaggle.com/competitions/bluebook-for-bulldozers/overview/evaluation

注意:多数回归模型的评价标准都是减小误差。比如这次的目标就是最小化RMSLE。

4. 使用的特征

特征过多,请自行进入kaggle项目主页查看。或点击如下谷歌表格链接:https://docs.google.com/spreadsheets/d/1EIbdGa4S_46USXgg0OHX5jgTc8ld9fTHwPyi_VOV1as/edit#gid=0

零、导入模块

# EDA
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
sns.set()
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
%config InlineBackend.figure_config = 'svg'
warnings.filterwarnings("ignore")

# 数据预处理
from sklearn.preprocessing import LabelEncoder

# sklearn模型
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor

# 模型评估
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_log_error, mean_absolute_error, mean_squared_error, r2_score

一、EDA

bulldozer_df = pd.read_csv('bluebook-for-bulldozers/TrainAndValid.csv',
                           low_memory = False)
appendix_df = pd.read_csv('bluebook-for-bulldozers/Machine_Appendix.csv',
                           low_memory = False)

1.1 查看基本信息

# 查看各字段类型
bulldozer_df.info()

RangeIndex: 412698 entries, 0 to 412697
Data columns (total 53 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   SalesID                   412698 non-null  int64  
 1   SalePrice                 412698 non-null  float64
 2   MachineID                 412698 non-null  int64  
 3   ModelID                   412698 non-null  int64  
 4   datasource                412698 non-null  int64  
 5   auctioneerID              392562 non-null  float64
 6   YearMade                  412698 non-null  int64  
 7   MachineHoursCurrentMeter  147504 non-null  float64
 8   UsageBand                 73670 non-null   object 
 9   saledate                  412698 non-null  object 
 10  fiModelDesc               412698 non-null  object 
 11  fiBaseModel               412698 non-null  object 
 12  fiSecondaryDesc           271971 non-null  object 
 13  fiModelSeries             58667 non-null   object 
 14  fiModelDescriptor         74816 non-null   object 
 15  ProductSize               196093 non-null  object 
 16  fiProductClassDesc        412698 non-null  object 
 17  state                     412698 non-null  object 
 18  ProductGroup              412698 non-null  object 
 19  ProductGroupDesc          412698 non-null  object 
 20  Drive_System              107087 non-null  object 
 21  Enclosure                 412364 non-null  object 
 22  Forks                     197715 non-null  object 
 23  Pad_Type                  81096 non-null   object 
 24  Ride_Control              152728 non-null  object 
 25  Stick                     81096 non-null   object 
 26  Transmission              188007 non-null  object 
 27  Turbocharged              81096 non-null   object 
 28  Blade_Extension           25983 non-null   object 
 29  Blade_Width               25983 non-null   object 
 30  Enclosure_Type            25983 non-null   object 
 31  Engine_Horsepower         25983 non-null   object 
 32  Hydraulics                330133 non-null  object 
 33  Pushblock                 25983 non-null   object 
 34  Ripper                    106945 non-null  object 
 35  Scarifier                 25994 non-null   object 
 36  Tip_Control               25983 non-null   object 
 37  Tire_Size                 97638 non-null   object 
 38  Coupler                   220679 non-null  object 
 39  Coupler_System            44974 non-null   object 
 40  Grouser_Tracks            44875 non-null   object 
 41  Hydraulics_Flow           44875 non-null   object 
 42  Track_Type                102193 non-null  object 
 43  Undercarriage_Pad_Width   102916 non-null  object 
 44  Stick_Length              102261 non-null  object 
 45  Thumb                     102332 non-null  object 
 46  Pattern_Changer           102261 non-null  object 
 47  Grouser_Type              102193 non-null  object 
 48  Backhoe_Mounting          80712 non-null   object 
 49  Blade_Type                81875 non-null   object 
 50  Travel_Controls           81877 non-null   object 
 51  Differential_Type         71564 non-null   object 
 52  Steering_Controls         71522 non-null   object 
dtypes: float64(3), int64(5), object(45)
memory usage: 166.9+ MB
# 查看缺失值
bulldozer_df.isna().sum()
SalesID                          0
SalePrice                        0
MachineID                        0
ModelID                          0
datasource                       0
auctioneerID                 20136
YearMade                         0
MachineHoursCurrentMeter    265194
UsageBand                   339028
saledate                         0
fiModelDesc                      0
fiBaseModel                      0
fiSecondaryDesc             140727
fiModelSeries               354031
fiModelDescriptor           337882
ProductSize                 216605
fiProductClassDesc               0
state                            0
ProductGroup                     0
ProductGroupDesc                 0
Drive_System                305611
Enclosure                      334
Forks                       214983
Pad_Type                    331602
Ride_Control                259970
Stick                       331602
Transmission                224691
Turbocharged                331602
Blade_Extension             386715
Blade_Width                 386715
Enclosure_Type              386715
Engine_Horsepower           386715
Hydraulics                   82565
Pushblock                   386715
Ripper                      305753
Scarifier                   386704
Tip_Control                 386715
Tire_Size                   315060
Coupler                     192019
Coupler_System              367724
Grouser_Tracks              367823
Hydraulics_Flow             367823
Track_Type                  310505
Undercarriage_Pad_Width     309782
Stick_Length                310437
Thumb                       310366
Pattern_Changer             310437
Grouser_Type                310505
Backhoe_Mounting            331986
Blade_Type                  330823
Travel_Controls             330821
Differential_Type           341134
Steering_Controls           341176
dtype: int64
# 查看标签分布
bulldozer_df['SalePrice'].hist()

【Python机器学习】回归模型:推土机售价预测_第1张图片

1.2 特征类型转换

# 改为帕斯卡命名
bulldozer_df.rename(columns={'saledate': 'SaleDate'}, inplace=True)
bulldozer_df['SaleDate'] = pd.to_datetime(bulldozer_df['SaleDate'])
# 按时间查看售价,只看最近几年的,没有明显的上升趋势,只有上下波动
plt.figure(figsize=(20,15))
pd.pivot_table(bulldozer_df[200000::1000], index='SaleDate', values='SalePrice').plot()
plt.show()

【Python机器学习】回归模型:推土机售价预测_第2张图片

2005年前后半年和2008年整个一年销量都比较差,除此之外看不出太多信息

# 将数据集按时间排序
bulldozer_df.sort_values(by='SaleDate', inplace=True)
bulldozer_df['SaleDate'].head(20)
205615   1989-01-17
274835   1989-01-31
141296   1989-01-31
212552   1989-01-31
62755    1989-01-31
54653    1989-01-31
81383    1989-01-31
204924   1989-01-31
135376   1989-01-31
113390   1989-01-31
113394   1989-01-31
116419   1989-01-31
32138    1989-01-31
127610   1989-01-31
76171    1989-01-31
127000   1989-01-31
128130   1989-01-31
127626   1989-01-31
55455    1989-01-31
55454    1989-01-31
Name: SaleDate, dtype: datetime64[ns]

1.3 联表+特征初筛

这个数据集比较特殊,还有个appendix表,里面是一些推土机的配件信息,而且这个信息和train表有重复特征,重复特征里面还有匹配不上的情况,现在先联表上来看看

# 制作数据集副本,这是为了方便对数据集做了什么操作后,仍然可以获取原始数据,而不用从头读数据
bd_df = bulldozer_df.copy()
app_df = appendix_df.copy()
# SalesID列丢弃
bd_df.drop('SalesID', axis=1, inplace=True)
# 定义一个查看出入的函数
def check_difference(df1, df2, target_col, on_col):
    temp_df = pd.merge(df1[[on_col, target_col]], df2[[on_col, target_col]], how='left', on=on_col)
    return temp_df[(temp_df[target_col+'_x'] != temp_df[target_col+'_y'])]
# 定义一个合并时互补的函数,冲突时保留df1的数据
def combine(df1, df2, target_col, on_col):
    temp_df0 = pd.merge(df1[[on_col, target_col]], df2[[on_col, target_col]], how='left', on=on_col)
    temp_df0.fillna('', inplace=True)
    temp_df1=temp_df0[(temp_df0[target_col+'_x']== '') & (temp_df0[target_col+'_y']!='')]
    temp_df0[target_col+'_x'].loc[temp_df1.index] = temp_df1[target_col+'_y']
    df1[target_col] = temp_df0[target_col+'_x']
    df2.drop(columns=[target_col], inplace=True)
    return df1

1.3.1 删除包含重复信息的特征

ProductGroup是ProductGroupDesc的首字母缩写版,选择保留后者

fiManufacturerID和fiManufacturerDesc包含的信息一样,选择保留前者

bd_df.drop(columns=['ProductGroup', 'fiProductClassDesc'], inplace=True)
app_df.drop(columns=['ModelID', 'fiModelDesc', 'ProductGroup', 'fiManufacturerDesc'], inplace=True)

1.3.2 fiBaseModel

# 查看枚举值,后续需要做分箱合并处理
bd_df['fiBaseModel'].value_counts()
580      20179
310      17886
D6       13527
416      12900
D5        9636
         ...  
56           1
B4230        1
IS30         1
MM555        1
WLK15        1
Name: fiBaseModel, Length: 1961, dtype: int64
# 查看出入部分有无空值
check_difference(bd_df, app_df, 'fiBaseModel', 'MachineID').isna().sum()
MachineID        0
fiBaseModel_x    0
fiBaseModel_y    0
dtype: int64
# 查看出入部分
check_difference(bd_df, app_df, 'fiBaseModel', 'MachineID')
MachineID fiBaseModel_x fiBaseModel_y
71 1523610 WA150 HD465
98 1059447 WA300 PC100
123 1303779 415 862
128 1340389 MS240 MS120
179 1208516 D31 PC100
... ... ... ...
412474 2308891 1845 75
412484 2287735 T135 T133
412509 2292146 450 465
412655 1846321 TB135 TB125
412695 1918416 337 530

12452 rows × 3 columns

# 选择保留bd_df表的数据
app_df.drop(columns=['fiBaseModel'], inplace=True)

1.3.3 fiSecondaryDesc

# 查看枚举值,后续需要做分箱合并处理
bd_df['fiSecondaryDesc'].value_counts()
C         44431
B         40165
G         37915
H         24729
E         21532
          ...  
BLGPPS        1
MSR           1
LC7A          1
CL            1
BH            1
Name: fiSecondaryDesc, Length: 177, dtype: int64
# 查看不一致部分
check_difference(bd_df, app_df, 'fiSecondaryDesc', 'MachineID')
MachineID fiSecondaryDesc_x fiSecondaryDesc_y
0 1126363 NaN NaN
1 1194089 NaN NaN
3 1327630 NaN NaN
6 1082797 NaN NaN
7 1527216 NaN NaN
... ... ... ...
412690 1823846 NaN NaN
412691 1278794 NaN NaN
412692 1792049 NaN NaN
412694 1919104 NaN NaN
412695 1918416 G NaN

147009 rows × 3 columns

# 合并互补
bd_df = combine(bd_df, app_df, 'fiSecondaryDesc', 'MachineID')

1.3.4 fiModelSeries

# 查看枚举值,后续需要做分箱合并处理
bd_df['fiModelSeries'].value_counts()
II      13770
LC       9175
III      5351
-1       4646
-2       4033
        ...  
LL          1
6F          1
-2LC        1
-5A         1
VII         1
Name: fiModelSeries, Length: 123, dtype: int64
# 查看不一致部分
check_difference(bd_df, app_df, 'fiModelSeries', 'MachineID')
MachineID fiModelSeries_x fiModelSeries_y
0 1126363 NaN NaN
1 1194089 NaN NaN
2 1473654 NaN NaN
3 1327630 NaN NaN
4 1336053 NaN NaN
... ... ... ...
412693 1915521 NaN NaN
412694 1919104 NaN NaN
412695 1918416 NaN NaN
412696 509560 NaN NaN
412697 1869284 NaN NaN

362366 rows × 3 columns

# 合并互补
bd_df = combine(bd_df, app_df, 'fiModelSeries', 'MachineID')

1.3.5 fiModelDescriptor

# 查看枚举值
bd_df['fiModelDescriptor'].value_counts()
L            16464
LGP          16143
LC           13295
XL            6700
6             2944
             ...  
K5               1
HighLift         1
High Lift        1
III              1
SL               1
Name: fiModelDescriptor, Length: 140, dtype: int64
# 查看不一致部分
check_difference(bd_df, app_df, 'fiModelDescriptor', 'MachineID')
MachineID fiModelDescriptor_x fiModelDescriptor_y
0 1126363 NaN NaN
1 1194089 NaN NaN
2 1473654 NaN NaN
3 1327630 NaN NaN
4 1336053 NaN NaN
... ... ... ...
412693 1915521 NaN NaN
412694 1919104 NaN NaN
412695 1918416 NaN NaN
412696 509560 NaN NaN
412697 1869284 NaN NaN

342069 rows × 3 columns

# 合并互补
bd_df = combine(bd_df, app_df, 'fiModelDescriptor', 'MachineID')

1.3.6 ProductGroupDesc

# 查看枚举值
bd_df['ProductGroupDesc'].value_counts()
Track Excavators       104230
Track Type Tractors     82582
Backhoe Loaders         81401
Wheel Loader            73216
Skid Steer Loaders      45011
Motor Graders           26258
Name: ProductGroupDesc, dtype: int64
# 枚举值太多了,而且bd_df里的数据并无空值,这里就选择不互补了,直接保留bd_df的数据
app_df['ProductGroupDesc'].value_counts()
Track Excavators                 89094
Backhoe Loaders                  74074
Track Type Tractors              67362
Wheel Loader                     62426
Skid Steer Loaders               42121
Motor Graders                    22602
Track Loaders                      158
Articulated Trucks                 109
Ag Tractors                        103
Wheel Tractor Scraper              102
Off Highway Trucks                  81
Multi Terrain Loaders               66
Wheel Excavator                     66
Forklift                            53
Skidders                            46
Wheel Feller Buncher                31
Forestry Log Loaders                25
Pipelayers                          11
Telehandler                          9
Wheel Dozer                          7
Knuckleboom Loaders                  6
Track Feller Bunchers                6
Vibratory Double Drum Asphalt        4
Vibratory Single Drum Asphalt        4
Work Tool                            3
Pneumatic Tired Compactor            3
Compactors                           3
Vibratory Single Drum Pad            3
Tandem Roller Static                 2
Harvesters                           2
Forwarders                           2
Engine, Industrial OEM               2
Track Harvesters                     1
Delimber Forestry                    1
Asphalt/Concrete Pavers              1
Vibratory Single Drum Smooth         1
Crane/Dragline                       1
Forestry Excavators                  1
Cold Planers                         1
Name: ProductGroupDesc, dtype: int64
app_df.drop(columns=['ProductGroupDesc'], inplace=True)

1.3.7 MfgYear

# 两边同一个特征名字不一样,先改名
app_df.rename(columns={'MfgYear': 'YearMade'}, inplace=True)
# 查看枚举值,发现异常值1000
bd_df['YearMade'].value_counts()
1000    39391
2005    22096
1998    21751
2004    20914
1999    19274
        ...  
2012        1
1949        1
1942        1
2013        1
1937        1
Name: YearMade, Length: 73, dtype: int64
# 查看不一致部分
check_difference(bd_df, app_df, 'YearMade', 'MachineID')
MachineID YearMade_x YearMade_y
52 1531656 1975 1987.0
58 1078132 1967 1966.0
71 1523610 1986 2007.0
98 1059447 1984 1996.0
122 1525180 1984 1985.0
... ... ... ...
412651 1631729 1000 1990.0
412654 1912106 1000 1996.0
412655 1846321 1000 2005.0
412667 1897535 1998 1999.0
412696 509560 1993 1990.0

36406 rows × 3 columns

这种不一致通常是因为bd_df里面有重复的MachineID,即同一种推土机被卖了多次,而app_df里只有唯一MachineID所导致,决定保留bd_df的数据,不动它

app_df.drop('YearMade', axis=1, inplace=True)

1.3.8 fiManufacturerID、PrimarySizeBasis、PrimaryLower、PrimaryUpper

这三个特征原训练集上都没有,直接联过去

# total_df = bd_df.copy()
total_df = pd.merge(bd_df, app_df, how='left', on='MachineID')
# 动力上限和下限留一个就足够区分了,取下限
total_df.drop(columns=['PrimaryUpper'], inplace=True)
total_df.info()

Int64Index: 412698 entries, 0 to 412697
Data columns (total 54 columns):
 #   Column                    Non-Null Count   Dtype         
---  ------                    --------------   -----         
 0   SalePrice                 412698 non-null  float64       
 1   MachineID                 412698 non-null  int64         
 2   ModelID                   412698 non-null  int64         
 3   datasource                412698 non-null  int64         
 4   auctioneerID              392562 non-null  float64       
 5   YearMade                  412698 non-null  int64         
 6   MachineHoursCurrentMeter  147504 non-null  float64       
 7   UsageBand                 73670 non-null   object        
 8   SaleDate                  412698 non-null  datetime64[ns]
 9   fiModelDesc               412698 non-null  object        
 10  fiBaseModel               412698 non-null  object        
 11  fiSecondaryDesc           412698 non-null  object        
 12  fiModelSeries             412698 non-null  object        
 13  fiModelDescriptor         412698 non-null  object        
 14  ProductSize               196093 non-null  object        
 15  state                     412698 non-null  object        
 16  ProductGroupDesc          412698 non-null  object        
 17  Drive_System              107087 non-null  object        
 18  Enclosure                 412364 non-null  object        
 19  Forks                     197715 non-null  object        
 20  Pad_Type                  81096 non-null   object        
 21  Ride_Control              152728 non-null  object        
 22  Stick                     81096 non-null   object        
 23  Transmission              188007 non-null  object        
 24  Turbocharged              81096 non-null   object        
 25  Blade_Extension           25983 non-null   object        
 26  Blade_Width               25983 non-null   object        
 27  Enclosure_Type            25983 non-null   object        
 28  Engine_Horsepower         25983 non-null   object        
 29  Hydraulics                330133 non-null  object        
 30  Pushblock                 25983 non-null   object        
 31  Ripper                    106945 non-null  object        
 32  Scarifier                 25994 non-null   object        
 33  Tip_Control               25983 non-null   object        
 34  Tire_Size                 97638 non-null   object        
 35  Coupler                   220679 non-null  object        
 36  Coupler_System            44974 non-null   object        
 37  Grouser_Tracks            44875 non-null   object        
 38  Hydraulics_Flow           44875 non-null   object        
 39  Track_Type                102193 non-null  object        
 40  Undercarriage_Pad_Width   102916 non-null  object        
 41  Stick_Length              102261 non-null  object        
 42  Thumb                     102332 non-null  object        
 43  Pattern_Changer           102261 non-null  object        
 44  Grouser_Type              102193 non-null  object        
 45  Backhoe_Mounting          80712 non-null   object        
 46  Blade_Type                81875 non-null   object        
 47  Travel_Controls           81877 non-null   object        
 48  Differential_Type         71564 non-null   object        
 49  Steering_Controls         71522 non-null   object        
 50  fiProductClassDesc        412698 non-null  object        
 51  fiManufacturerID          412698 non-null  int64         
 52  PrimarySizeBasis          407439 non-null  object        
 53  PrimaryLower              407439 non-null  float64       
dtypes: datetime64[ns](1), float64(4), int64(5), object(44)
memory usage: 173.2+ MB
raise KeyError
---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

Input In [903], in ()
----> 1 raise KeyError


KeyError: 

1.4 逐个查看特征

1.4.1 datasource

# 173怀疑是172写错成173
total_df['datasource'].value_counts()
132    260776
136     75491
149     33325
121     25191
172     17914
173         1
Name: datasource, dtype: int64

1.4.2 auctioneerID

# 后续要合并
total_df['auctioneerID'].value_counts()
1.0     192773
2.0      57441
3.0      30288
4.0      20877
99.0     12042
6.0      11950
7.0       7847
8.0       7419
5.0       7002
10.0      5876
9.0       4764
11.0      3823
12.0      3610
13.0      3068
18.0      2359
14.0      2277
20.0      2238
19.0      2074
16.0      1807
15.0      1742
21.0      1601
22.0      1429
24.0      1357
23.0      1322
17.0      1275
27.0      1150
25.0       959
28.0       860
26.0       796
0.0        536
Name: auctioneerID, dtype: int64
# 查看auctioneerID和价格的关系
temp = pd.pivot_table(total_df, index='auctioneerID', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('median', 'SalePrice'))
count mean median
SalePrice SalePrice SalePrice
auctioneerID
22.0 1429 18230.020994 14000.0
18.0 2359 18839.799491 14500.0
21.0 1601 19645.315428 15000.0
25.0 959 22478.519291 16000.0
9.0 4764 22515.003149 16775.0
14.0 2277 20804.797980 17000.0
17.0 1275 22974.941176 18000.0
12.0 3610 24762.936288 18250.0
28.0 860 28312.674419 18750.0
20.0 2238 24737.811439 19250.0
99.0 12042 26958.806112 19500.0
13.0 3068 27017.441330 20500.0
16.0 1807 26261.427781 20500.0
27.0 1150 27606.739130 22500.0
4.0 20877 29825.726876 23000.0
23.0 1322 30613.729198 23000.0
2.0 57441 29023.051670 23000.0
10.0 5876 29561.546971 23000.0
0.0 536 29979.850746 23000.0
15.0 1742 29986.337543 23500.0
5.0 7002 29150.217795 23500.0
24.0 1357 33041.156964 24500.0
1.0 192773 32684.870075 25000.0
8.0 7419 32477.978838 25000.0
11.0 3823 32707.088674 25500.0
3.0 30288 33596.962592 26000.0
6.0 11950 34708.070628 27000.0
26.0 796 36157.412060 28000.0
7.0 7847 36288.450236 28500.0
19.0 2074 42715.766635 34000.0
# 画图查看
temp.sort_values(by=('median', 'SalePrice'))[('median', 'SalePrice')].plot(kind='bar')
plt.show()

【Python机器学习】回归模型:推土机售价预测_第3张图片

1.4.3 YearMade

total_df['YearMade'] = total_df['YearMade'].astype(int)
total_df['YearMade'].value_counts()
1000    39391
2005    22096
1998    21751
2004    20914
1999    19274
        ...  
2012        1
1949        1
1942        1
2013        1
1937        1
Name: YearMade, Length: 73, dtype: int64
plt.figure(figsize=(14,10))
sns.countplot(total_df['YearMade'])
plt.xticks(rotation=90)
plt.show()

【Python机器学习】回归模型:推土机售价预测_第4张图片

第一台拖拉机1904年才发明出来,1904年前的属于异常数据。同时该数据集截至年份是2012年,大于2012的属于异常。异常值后续增加一个新的衍生变量YearMade_is_error区分。

生产时间大于销售时间的,暂时认为是提前订货,不处理。

1.4.4 MachineHoursCurrentMeter

# 查看枚举值
total_df['MachineHoursCurrentMeter'].value_counts()
0.0        73834
2000.0       124
1000.0       117
24.0         115
1500.0       101
           ...  
10834.0        1
3499.0         1
26270.0        1
26901.0        1
17920.0        1
Name: MachineHoursCurrentMeter, Length: 15633, dtype: int64

1.4.5 UsageBand

# 查看枚举值
total_df['UsageBand'].value_counts()
Medium    35832
Low       25311
High      12527
Name: UsageBand, dtype: int64

1.4.6 fiBaseModel

total_df['fiBaseModel'].value_counts()
580      20179
310      17886
D6       13527
416      12900
D5        9636
         ...  
56           1
B4230        1
IS30         1
MM555        1
WLK15        1
Name: fiBaseModel, Length: 1961, dtype: int64

1.4.7 fiSecondaryDesc

total_df['fiSecondaryDesc'].value_counts().head(10)
     136420
C     44658
B     40446
G     38139
H     24759
E     21944
D     20132
F      9454
K      8089
A      5968
Name: fiSecondaryDesc, dtype: int64
temp = pd.pivot_table(total_df, index='fiSecondaryDesc', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count mean median
SalePrice SalePrice SalePrice
fiSecondaryDesc
XT 2177 29736.816720 23000.0
LE 2510 31586.065737 23000.0
R 3577 31164.435840 24000.0
N 3844 31028.674558 23500.0
SUPER L 3920 31237.168367 24000.0
P 4475 32841.134078 26000.0
J 4655 31069.470892 24000.0
LC 4862 31265.127931 24000.0
M 5398 32225.976288 24500.0
L 5669 31634.053625 24500.0
A 5968 31266.457775 24000.0
K 8089 30815.391396 23000.0
F 9454 31958.080389 25000.0
D 20132 30766.167644 24000.0
E 21944 31089.640357 24000.0
H 24759 31279.811059 24000.0
G 38139 30902.040200 23500.0
B 40446 31238.508530 24000.0
C 44658 31024.103699 24000.0
136420 31426.038213 24000.0

1.4.8 fiModelSeries

total_df['fiModelSeries'].value_counts().head(10)
       350453
II      14039
LC       9609
III      5392
-1       5142
-2       4350
-6       3538
-3       2783
-5       2664
-12      1447
Name: fiModelSeries, dtype: int64

1.4.9 fiModelDescriptor

total_df['fiModelDescriptor'].value_counts().head(10)
       333040
L       16676
LGP     16541
LC      15666
XL       6704
6        3238
LT       2681
5        2455
3        2149
CR       1798
Name: fiModelDescriptor, dtype: int64
temp = pd.pivot_table(total_df, index='fiModelDescriptor', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count mean median
SalePrice SalePrice SalePrice
fiModelDescriptor
K 349 29030.372493 22000.0
8 351 33369.800570 27000.0
E 359 34427.576602 25500.0
SR 383 28176.827676 20000.0
ZTS 442 28579.751131 21250.0
Z 525 27814.190476 21000.0
2 627 32085.047847 26000.0
SSR 684 33503.508772 27000.0
H 1092 30992.971612 24000.0
7 1115 30983.373991 24000.0
CR 1798 28777.157953 22000.0
3 2149 31477.175430 24000.0
5 2455 30897.668024 24000.0
LT 2681 28216.729206 21000.0
6 3238 30573.305127 24000.0
XL 6704 30564.901700 23000.0
LC 15666 30387.232989 23000.0
LGP 16541 31327.267033 24000.0
L 16676 30739.471036 24000.0
333040 31340.137891 24000.0

1.4.10 ProductSize

total_df['ProductSize'].fillna('').value_counts()
                  216605
Medium             64342
Large / Medium     51297
Small              27057
Mini               25721
Large              21396
Compact             6280
Name: ProductSize, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='ProductSize', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count mean median
SalePrice SalePrice SalePrice
ProductSize
Compact 6280 17498.578822 15000.0
Large 21396 42023.433315 34000.0
Mini 25721 15194.647720 13500.0
Small 27057 32511.164061 29000.0
Large / Medium 51297 47828.923231 44000.0
Medium 64342 45703.873473 41000.0
216605 24047.383408 19000.0

1.4.11 state

# 查看数据分布
total_df['state'].fillna('').value_counts()
Florida           67320
Texas             53110
California        29761
Washington        16222
Georgia           14633
Maryland          13322
Mississippi       13240
Ohio              12369
Illinois          11540
Colorado          11529
New Jersey        11156
North Carolina    10636
Tennessee         10298
Alabama           10292
Pennsylvania      10234
South Carolina     9951
Arizona            9364
New York           8639
Connecticut        8276
Minnesota          7885
Missouri           7178
Nevada             6932
Louisiana          6627
Kentucky           5351
Maine              5096
Indiana            4124
Arkansas           3933
New Mexico         3631
Utah               3046
Unspecified        2801
Wisconsin          2745
New Hampshire      2738
Virginia           2353
Idaho              2025
Oregon             1911
Michigan           1831
Wyoming            1672
Montana            1336
Iowa               1336
Oklahoma           1326
Nebraska            866
West Virginia       840
Kansas              667
Delaware            510
North Dakota        480
Alaska              430
Massachusetts       347
Vermont             300
South Dakota        244
Hawaii              118
Rhode Island         83
Puerto Rico          42
Washington DC         2
Name: state, dtype: int64
# 查看州和价格的关系,有关系
temp = pd.pivot_table(total_df, index='state', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count mean median
SalePrice SalePrice SalePrice
state
Minnesota 7885 27199.055168 19500.0
Connecticut 8276 29008.808603 23500.0
New York 8639 25582.237527 20000.0
Arizona 9364 31562.857753 23000.0
South Carolina 9951 29848.894583 22000.0
Pennsylvania 10234 25463.002736 19000.0
Alabama 10292 35438.541489 28000.0
Tennessee 10298 31845.357351 24000.0
North Carolina 10636 32161.773693 25000.0
New Jersey 11156 30982.188867 24500.0
Colorado 11529 31777.748287 25000.0
Illinois 11540 29091.581456 22500.0
Ohio 12369 28228.100493 22000.0
Mississippi 13240 32574.592145 25000.0
Maryland 13322 28621.682180 22000.0
Georgia 14633 32265.550468 24000.0
Washington 16222 27690.731722 21500.0
California 29761 29815.202715 22500.0
Texas 53110 32977.190341 25000.0
Florida 67320 34387.512775 27000.0
# 进一步按价格排序查看
temp.sort_values(by=[('median', 'SalePrice'), ('mean', 'SalePrice')])
count mean median
SalePrice SalePrice SalePrice
state
Indiana 4124 24400.357662 19000.0
Pennsylvania 10234 25463.002736 19000.0
Maine 5096 26176.942700 19500.0
Minnesota 7885 27199.055168 19500.0
New York 8639 25582.237527 20000.0
Kansas 667 28093.403298 20000.0
Puerto Rico 42 26011.904762 20250.0
Wisconsin 2745 27656.247723 21000.0
Idaho 2025 29263.234568 21000.0
Washington 16222 27690.731722 21500.0
Virginia 2353 28798.645984 21500.0
Ohio 12369 28228.100493 22000.0
Maryland 13322 28621.682180 22000.0
Michigan 1831 29003.923648 22000.0
Missouri 7178 29046.000697 22000.0
Kentucky 5351 29815.380303 22000.0
South Carolina 9951 29848.894583 22000.0
Vermont 300 27285.333333 22250.0
Arkansas 3933 28906.893974 22500.0
Illinois 11540 29091.581456 22500.0
California 29761 29815.202715 22500.0
Washington DC 2 22750.000000 22750.0
Massachusetts 347 28382.420749 23000.0
New Hampshire 2738 28928.743608 23000.0
Oregon 1911 30277.590267 23000.0
Delaware 510 31160.098039 23000.0
Arizona 9364 31562.857753 23000.0
Connecticut 8276 29008.808603 23500.0
Louisiana 6627 30201.893768 23500.0
Tennessee 10298 31845.357351 24000.0
Iowa 1336 31927.544910 24000.0
Oklahoma 1326 32258.710407 24000.0
Georgia 14633 32265.550468 24000.0
Montana 1336 32616.654192 24000.0
Nebraska 866 32612.066975 24250.0
New Jersey 11156 30982.188867 24500.0
Colorado 11529 31777.748287 25000.0
North Carolina 10636 32161.773693 25000.0
Mississippi 13240 32574.592145 25000.0
Wyoming 1672 32604.425837 25000.0
Texas 53110 32977.190341 25000.0
Hawaii 118 28879.237288 26000.0
Alaska 430 33281.976744 26000.0
New Mexico 3631 33632.649408 26000.0
Utah 3046 34190.547932 27000.0
Florida 67320 34387.512775 27000.0
Nevada 6932 36332.097519 27000.0
Unspecified 2801 34857.711532 28000.0
Alabama 10292 35438.541489 28000.0
North Dakota 480 39083.750000 29750.0
West Virginia 840 40258.750000 33000.0
Rhode Island 83 37622.289157 34000.0
South Dakota 244 43907.377049 35000.0

1.4.12 ProductGroupDesc

# 查看数据分布
total_df['ProductGroupDesc'].fillna('').value_counts()
Track Excavators       104230
Track Type Tractors     82582
Backhoe Loaders         81401
Wheel Loader            73216
Skid Steer Loaders      45011
Motor Graders           26258
Name: ProductGroupDesc, dtype: int64
# 查看和价格的关系
temp = pd.pivot_table(total_df, index='ProductGroupDesc', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count mean median
SalePrice SalePrice SalePrice
ProductGroupDesc
Motor Graders 26258 47561.957422 40000.0
Skid Steer Loaders 45011 10583.944014 10000.0
Wheel Loader 73216 37259.292422 32000.0
Backhoe Loaders 81401 20951.582327 20500.0
Track Type Tractors 82582 36280.132780 29500.0
Track Excavators 104230 35763.457018 29000.0

1.4.13 Drive_System

total_df['Drive_System'].fillna('').value_counts()
                    305611
Two Wheel Drive      47546
Four Wheel Drive     33551
No                   25166
All Wheel Drive        824
Name: Drive_System, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Drive_System', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count mean median
SalePrice SalePrice SalePrice
Drive_System
All Wheel Drive 824 60455.279126 59000.0
No 25166 47342.721450 40000.0
Four Wheel Drive 33551 24534.149414 24000.0
Two Wheel Drive 47546 18418.026879 17500.0
305611 32532.703693 25000.0

1.4.14 Enclosure

total_df['Enclosure'].fillna('').value_counts()
OROPS                  177971
EROPS                  141769
EROPS w AC              92601
                          334
EROPS AC                   18
NO ROPS                     3
None or Unspecified         2
Name: Enclosure, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Enclosure', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count mean median
SalePrice SalePrice SalePrice
Enclosure
None or Unspecified 2 16500.000000 16500.0
NO ROPS 3 44333.333333 42500.0
EROPS AC 18 23500.000000 20500.0
334 27689.446108 25000.0
EROPS w AC 92601 51671.169728 47000.0
EROPS 141769 28687.993278 23000.0
OROPS 177971 22592.082739 18000.0

1.4.15 Forks

total_df['Forks'].fillna('').value_counts()
                       214983
None or Unspecified    183061
Yes                     14654
Name: Forks, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Forks', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count mean median
SalePrice SalePrice SalePrice
Forks
Yes 14654 36761.354715 31500.0
None or Unspecified 183061 23593.414184 18500.0
214983 37327.174954 30000.0

1.4.16 Pad_Type

total_df['Pad_Type'].fillna('').value_counts()
                       331602
None or Unspecified     72395
Reversible               5950
Street                   2725
Grouser                    26
Name: Pad_Type, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Pad_Type', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count mean median
SalePrice SalePrice SalePrice
Pad_Type
Grouser 26 30289.423077 30500.0
Street 2725 24995.064220 25000.0
Reversible 5950 27344.033613 27500.0
None or Unspecified 72395 20267.120354 20000.0
331602 33725.998897 26000.0

1.4.17 Ride_Control

total_df['Ride_Control'].fillna('').value_counts()
                       259970
No                      79389
None or Unspecified     64693
Yes                      8646
Name: Ride_Control, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Ride_Control', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count mean median
SalePrice SalePrice SalePrice
Ride_Control
Yes 8646 55278.269142 53000.0
None or Unspecified 64693 34834.650024 29000.0
No 79389 20765.720100 20000.0
259970 32705.232362 25000.0

1.4.18 Stick

total_df['Stick'].fillna('').value_counts()
            331602
Standard     49854
Extended     31242
Name: Stick, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Stick', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count mean median
SalePrice SalePrice SalePrice
Stick
Extended 31242 23257.320914 23000.0
Standard 49854 19501.525113 19000.0
331602 33725.998897 26000.0

1.4.19 Transmission

total_df['Transmission'].fillna('').value_counts()
                       224691
Standard               143915
None or Unspecified     23889
Powershift              11991
Powershuttle             4286
Hydrostatic              3342
Direct Drive              422
Autoshift                 118
AutoShift                  44
Name: Transmission, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Transmission', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count mean median
SalePrice SalePrice SalePrice
Transmission
AutoShift 44 102261.363636 95000.0
Autoshift 118 33141.949153 33250.0
Direct Drive 422 18675.829384 11550.0
Hydrostatic 3342 34705.814183 32000.0
Powershuttle 4286 19264.930938 18000.0
Powershift 11991 38574.352931 29500.0
None or Unspecified 23889 47354.812968 40000.0
Standard 143915 28364.132218 23000.0
224691 31117.253842 23000.0

1.4.20 Turbocharged

total_df['Turbocharged'].fillna('').value_counts()
                       331602
None or Unspecified     77111
Yes                      3985
Name: Turbocharged, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Turbocharged', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count mean median
SalePrice SalePrice SalePrice
Turbocharged
Yes 3985 24144.300376 24000.0
None or Unspecified 77111 20783.276264 20000.0
331602 33725.998897 26000.0

1.4.21 Blade_Extension

total_df['Blade_Extension'].fillna('').value_counts()
                       386715
None or Unspecified     25406
Yes                       577
Name: Blade_Extension, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Blade_Extension', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count mean median
SalePrice SalePrice SalePrice
Blade_Extension
Yes 577 66587.694974 65000.0
None or Unspecified 25406 47337.075022 40000.0
386715 30103.244279 23500.0

1.4.22 Blade_Width

total_df['Blade_Width'].fillna('').value_counts()
                       386715
14'                      9867
None or Unspecified      9521
12'                      5201
16'                       960
13'                       335
<12'                       99
Name: Blade_Width, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Blade_Width', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count mean median
SalePrice SalePrice SalePrice
Blade_Width
<12' 99 27764.646465 21000.0
13' 335 27125.522388 18500.0
16' 960 59882.119792 52750.0
12' 5201 36000.852144 29000.0
None or Unspecified 9521 41899.460141 32000.0
14' 9867 59347.223168 55000.0
386715 30103.244279 23500.0

1.4.23 Enclosure_Type

total_df['Enclosure_Type'].fillna('').value_counts()
                       386715
None or Unspecified     22469
Low Profile              2675
High Profile              839
Name: Enclosure_Type, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Enclosure_Type', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count mean median
SalePrice SalePrice SalePrice
Enclosure_Type
High Profile 839 81897.437426 80000.0
Low Profile 2675 81444.658318 80000.0
None or Unspecified 22469 42480.324759 35000.0
386715 30103.244279 23500.0

1.4.24 Engine_Horsepower

total_df['Engine_Horsepower'].fillna('').value_counts()
            386715
No           24642
Variable      1341
Name: Engine_Horsepower, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Engine_Horsepower', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count mean median
SalePrice SalePrice SalePrice
Engine_Horsepower
Variable 1341 90245.152871 88000.0
No 24642 45452.807321 38000.0
386715 30103.244279 23500.0

1.4.24 Hydraulics

total_df['Hydraulics'].fillna('').value_counts()
2 Valve                145317
Standard               106515
                        82565
Auxiliary               43224
Base + 1 Function       25511
3 Valve                  5807
4 Valve                  3077
Base + 3 Function         311
Base + 2 Function         132
Base + 5 Function          94
Base + 4 Function          81
Base + 6 Function          54
None or Unspecified        10
Name: Hydraulics, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Hydraulics', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count mean median
SalePrice SalePrice SalePrice
Hydraulics
None or Unspecified 10 27400.000000 26250.0
Base + 6 Function 54 68333.333333 69500.0
Base + 4 Function 81 95253.086420 92500.0
Base + 5 Function 94 75601.063830 71000.0
Base + 2 Function 132 89424.242424 89500.0
Base + 3 Function 311 84141.800643 79000.0
4 Valve 3077 56652.551186 51000.0
3 Valve 5807 44479.309454 40000.0
Base + 1 Function 25511 46637.433970 39000.0
Auxiliary 43224 25076.746784 16000.0
82565 21028.900297 20500.0
Standard 106515 29391.410139 20500.0
2 Valve 145317 36145.196393 30000.0

1.4.25 Pushblock

total_df['Pushblock'].fillna('').value_counts()
                       386715
None or Unspecified     20017
Yes                      5966
Name: Pushblock, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Pushblock', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count mean median
SalePrice SalePrice SalePrice
Pushblock
Yes 5966 68305.127724 65000.0
None or Unspecified 20017 41642.525653 33000.0
386715 30103.244279 23500.0

1.4.26 Ripper

total_df['Ripper'].fillna('').value_counts()
                       305753
None or Unspecified     85405
Yes                      8185
Multi Shank              8071
Single Shank             5284
Name: Ripper, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Ripper', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count mean median
SalePrice SalePrice SalePrice
Ripper
Single Shank 5284 50139.366389 42000.0
Multi Shank 8071 48952.397472 41000.0
Yes 8185 63677.683079 60000.0
None or Unspecified 85405 35253.466486 28500.0
305753 28422.902101 22000.0

1.4.27 Scarifier

total_df['Scarifier'].fillna('').value_counts()
                       386704
None or Unspecified     13033
Yes                     12961
Name: Scarifier, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Scarifier', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count mean median
SalePrice SalePrice SalePrice
Scarifier
Yes 12961 49523.266415 40000.0
None or Unspecified 13033 45996.798281 40000.0
386704 30103.375220 23500.0

1.4.28 Tip_Control

total_df['Tip_Control'].fillna('').value_counts()
                       386715
None or Unspecified     16832
Sideshift & Tip          7164
Tip                      1987
Name: Tip_Control, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Tip_Control', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count mean median
SalePrice SalePrice SalePrice
Tip_Control
Tip 1987 62269.265727 57000.0
Sideshift & Tip 7164 46739.039084 41000.0
None or Unspecified 16832 46488.790459 37500.0
386715 30103.244279 23500.0

1.4.29 Tire_Size

total_df['Tire_Size'].fillna('').value_counts()
                       315060
None or Unspecified     47823
20.5                    15773
14"                      9111
23.5                     8760
26.5                     4635
17.5                     3971
29.5                     2767
17.5"                    1815
13"                       776
20.5"                     737
15.5                      610
15.5"                     463
23.5"                     309
7.0"                       56
23.1"                      20
10"                         9
10 inch                     3
Name: Tire_Size, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Tire_Size', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count mean median
SalePrice SalePrice SalePrice
Tire_Size
10 inch 3 16333.333333 15500.0
10" 9 31166.666667 25000.0
23.1" 20 14837.500000 12125.0
7.0" 56 27625.000000 25500.0
23.5" 309 56487.944984 51000.0
15.5" 463 47401.727862 40000.0
15.5 610 17153.442623 15000.0
20.5" 737 74012.632293 68000.0
13" 776 21085.373711 16000.0
17.5" 1815 67478.650138 65000.0
29.5 2767 46650.813155 43000.0
17.5 3971 25344.856459 22500.0
26.5 4635 50903.011866 50000.0
23.5 8760 47236.347032 44000.0
14" 9111 50506.226869 44000.0
20.5 15773 41099.197997 38000.0
None or Unspecified 47823 35357.008218 27000.0
315060 28433.535937 22000.0

1.4.30 Coupler

total_df['Coupler'].fillna('').value_counts()
                       192019
None or Unspecified    190449
Manual                  23918
Hydraulic                6312
Name: Coupler, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Coupler', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count mean median
SalePrice SalePrice SalePrice
Coupler
Hydraulic 6312 46839.729087 43000.0
Manual 23918 33493.829877 27000.0
None or Unspecified 190449 30392.948847 22500.0
192019 31233.255205 24000.0

1.4.31 Coupler_System

total_df['Coupler_System'].fillna('').value_counts()
                       367724
None or Unspecified     41727
Yes                      3247
Name: Coupler_System, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Coupler_System', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count mean median
SalePrice SalePrice SalePrice
Coupler_System
Yes 3247 11593.701879 11500.0
None or Unspecified 41727 10504.809931 10000.0
367724 33738.521242 26000.0

1.4.32 Grouser_Tracks

total_df['Grouser_Tracks'].fillna('').value_counts()
                       367823
None or Unspecified     41820
Yes                      3055
Name: Grouser_Tracks, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Grouser_Tracks', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count mean median
SalePrice SalePrice SalePrice
Grouser_Tracks
Yes 3055 12871.268412 13000.0
None or Unspecified 41820 10410.951196 9750.0
367823 33732.896625 26000.0

1.4.33 Hydraulics_Flow

total_df['Hydraulics_Flow'].fillna('').value_counts()
                       367823
Standard                44251
High Flow                 597
None or Unspecified        27
Name: Hydraulics_Flow, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Hydraulics_Flow', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count mean median
SalePrice SalePrice SalePrice
Hydraulics_Flow
None or Unspecified 27 15053.703704 15000.0
High Flow 597 13183.165829 13000.0
Standard 44251 10540.573185 10000.0
367823 33732.896625 26000.0

1.4.34 Track_Type

total_df['Track_Type'].fillna('').value_counts()
          310505
Steel      87463
Rubber     14730
Name: Track_Type, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Track_Type', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count mean median
SalePrice SalePrice SalePrice
Track_Type
Rubber 14730 15823.905906 13500.0
Steel 87463 39245.894733 34000.0
310505 29683.235742 22500.0

1.4.35 Undercarriage_Pad_Width

total_df['Undercarriage_Pad_Width'].fillna('').value_counts()
                       309782
None or Unspecified     82444
32 inch                  5287
28 inch                  3152
24 inch                  2998
20 inch                  2664
30 inch                  1602
36 inch                  1544
18 inch                  1439
34 inch                   540
16 inch                   481
31 inch                   191
27 inch                   144
22 inch                   135
26 inch                    98
33 inch                    94
14 inch                    51
15 inch                    33
25 inch                    17
31.5 inch                   2
Name: Undercarriage_Pad_Width, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Undercarriage_Pad_Width', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count mean median
SalePrice SalePrice SalePrice
Undercarriage_Pad_Width
31.5 inch 2 108000.000000 108000.0
25 inch 17 22591.176471 20000.0
15 inch 33 16803.030303 15000.0
14 inch 51 16869.607843 14500.0
33 inch 94 51848.936170 47500.0
26 inch 98 30831.632653 28750.0
22 inch 135 33398.518519 26500.0
27 inch 144 33826.041667 28625.0
31 inch 191 48378.795812 44500.0
16 inch 481 20531.702703 17500.0
34 inch 540 61289.722222 60000.0
18 inch 1439 20302.710215 18000.0
36 inch 1544 46214.297280 41000.0
30 inch 1602 36943.008739 30000.0
20 inch 2664 29552.346096 27500.0
24 inch 2998 37054.419613 31500.0
28 inch 3152 37732.447652 33500.0
32 inch 5287 48901.324002 45500.0
None or Unspecified 82444 35017.391514 28000.0
309782 29688.370739 22500.0

1.4.36 Stick_Length

total_df['Stick_Length'].fillna('').value_counts()
                       310437
None or Unspecified     81539
9' 6"                    5832
10' 6"                   3519
11' 0"                   1601
9' 10"                   1463
9' 8"                    1462
9' 7"                    1423
12' 10"                  1087
10' 2"                   1004
8' 6"                     908
8' 2"                     614
10' 10"                   414
12' 8"                    322
11' 10"                   307
8' 4"                     274
8' 10"                    104
12' 4"                    103
9' 5"                     101
15' 9"                     87
6' 3"                      51
13' 7"                     11
14' 1"                      7
13' 10"                     7
13' 9"                      7
19' 8"                      5
7' 10"                      3
15' 4"                      3
24' 3"                      2
9' 2"                       1
Name: Stick_Length, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Stick_Length', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count mean median
SalePrice SalePrice SalePrice
Stick_Length
15' 9" 87 57068.965517 55000.0
9' 5" 101 46074.257426 46000.0
12' 4" 103 47505.339806 40000.0
8' 10" 104 43937.500000 42750.0
8' 4" 274 33362.226277 28000.0
11' 10" 307 48237.785016 42500.0
12' 8" 322 58167.701863 56000.0
10' 10" 414 60149.275362 57000.0
8' 2" 614 33933.306189 31000.0
8' 6" 908 36311.949339 31000.0
10' 2" 1004 45855.229084 44000.0
12' 10" 1087 66730.174793 66000.0
9' 7" 1423 56526.528461 55000.0
9' 8" 1462 49643.296854 47000.0
9' 10" 1463 39891.148325 36500.0
11' 0" 1601 48882.151780 45000.0
10' 6" 3519 53677.642796 50000.0
9' 6" 5832 46655.847051 42500.0
None or Unspecified 81539 32556.659697 26000.0
310437 29683.170383 22500.0

1.4.37 Thumb

total_df['Thumb'].fillna('').value_counts()
                       310366
None or Unspecified     85074
Manual                   9678
Hydraulic                7580
Name: Thumb, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Thumb', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count mean median
SalePrice SalePrice SalePrice
Thumb
Hydraulic 7580 38316.014644 33000.0
Manual 9678 39451.081318 34000.0
None or Unspecified 85074 35234.459047 28000.0
310366 29683.224368 22500.0

1.4.38 Pattern_Changer

total_df['Pattern_Changer'].fillna('').value_counts()
                       310437
None or Unspecified     92924
Yes                      9269
No                         68
Name: Pattern_Changer, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Pattern_Changer', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count mean median
SalePrice SalePrice SalePrice
Pattern_Changer
No 68 29981.617647 27500.0
Yes 9269 52301.261085 52000.0
None or Unspecified 92924 34230.870776 28000.0
310437 29683.170383 22500.0

1.4.39 Grouser_Type

total_df['Grouser_Type'].fillna('').value_counts()
          310505
Double     86998
Triple     15193
Single         2
Name: Grouser_Type, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Grouser_Type', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count mean median
SalePrice SalePrice SalePrice
Grouser_Type
Single 2 52000.000000 52000.0
Triple 15193 42755.055947 37500.0
Double 86998 34667.098784 28000.0
310505 29683.235742 22500.0

1.4.40 Backhoe_Mounting

total_df['Backhoe_Mounting'].fillna('').value_counts()
                       331986
None or Unspecified     80692
Yes                        20
Name: Backhoe_Mounting, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Backhoe_Mounting', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count mean median
SalePrice SalePrice SalePrice
Backhoe_Mounting
Yes 20 16462.500000 15500.0
None or Unspecified 80692 36486.290155 30000.0
331986 29934.882688 22500.0

1.4.41 Blade_Type

total_df['Blade_Type'].fillna('').value_counts()
                       330823
PAT                     39633
Straight                13461
None or Unspecified     11841
Semi U                   8907
VPAT                     3681
U                        1888
Angle                    1684
No                        743
Landfill                   26
Coal                       11
Name: Blade_Type, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Blade_Type', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count mean median
SalePrice SalePrice SalePrice
Blade_Type
Coal 11 50227.272727 48000.0
Landfill 26 69942.307692 63750.0
No 743 25569.078062 21000.0
Angle 1684 32742.923990 25500.0
U 1888 44536.493644 35000.0
VPAT 3681 62566.143711 59000.0
Semi U 8907 58495.607387 55000.0
None or Unspecified 11841 31077.752149 25000.0
Straight 13461 35548.990120 28000.0
PAT 39633 30679.544117 27000.0
330823 29949.806359 22500.0

1.4.42 Travel_Controls

total_df['Travel_Controls'].fillna('').value_counts()
                       330821
None or Unspecified     71447
Differential Steer       5257
Finger Tip               2693
2 Pedal                  1144
Lever                     902
Pedal                     423
1 Speed                    11
Name: Travel_Controls, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Travel_Controls', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count mean median
SalePrice SalePrice SalePrice
Travel_Controls
1 Speed 11 15672.727273 11500.0
Pedal 423 24667.966903 22500.0
Lever 902 33187.900222 26000.0
2 Pedal 1144 25588.002622 21000.0
Finger Tip 2693 57974.053101 55000.0
Differential Steer 5257 68752.648849 67500.0
None or Unspecified 71447 33407.344104 27500.0
330821 29950.385598 22500.0

1.4.43 Differential_Type

total_df['Differential_Type'].fillna('').value_counts()
                341134
Standard         70169
Limited Slip      1181
No Spin            212
Locking              2
Name: Differential_Type, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Differential_Type', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count mean median
SalePrice SalePrice SalePrice
Differential_Type
Locking 2 66000.000000 66000.0
No Spin 212 52889.386792 51475.0
Limited Slip 1181 57566.553768 57000.0
Standard 70169 37061.729952 32000.0
341134 29907.683667 22500.0

1.4.44 Steering_Controls

total_df['Steering_Controls'].fillna('').value_counts()
                       341176
Conventional            70774
Command Control           594
Four Wheel Standard       139
Wheel                      14
No                          1
Name: Steering_Controls, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='Steering_Controls', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count mean median
SalePrice SalePrice SalePrice
Steering_Controls
No 1 17500.000000 17500.0
Wheel 14 21517.857143 17500.0
Four Wheel Standard 139 24658.273381 22000.0
Command Control 594 74774.410774 75000.0
Conventional 70774 37168.659098 32000.0
341176 29907.455419 22500.0

1.4.45 fiManufacturerDesc

total_df['fiManufacturerID'].fillna('').value_counts()
26      169003
43       74527
25       42142
103      38928
121      25033
         ...  
5            2
923          1
1518         1
112          1
525          1
Name: fiManufacturerID, Length: 104, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='fiManufacturerID', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count mean median
SalePrice SalePrice SalePrice
fiManufacturerID
405 1070 10707.691589 9500.0
129 1117 14373.209490 10000.0
86 1178 34583.191851 30000.0
46 1344 17102.771577 15000.0
158 1378 30334.216255 26000.0
95 1628 34679.652948 31000.0
166 1906 23662.644281 20000.0
135 2553 18356.462593 14500.0
54 2585 18324.776402 15000.0
750 3305 12966.822995 11500.0
176 4224 43841.872633 40000.0
99 5087 33323.884411 28000.0
92 5260 21678.645437 18500.0
55 5902 13283.747035 11750.0
74 11291 33328.462935 27000.0
121 25033 10637.495067 10000.0
103 38928 33710.587988 27500.0
25 42142 21210.952589 18000.0
43 74527 27908.815087 23000.0
26 169003 40321.710425 33000.0

1.4.46 PrimarySizeBasis

total_df['PrimarySizeBasis'].fillna('').value_counts()
Horsepower                     180262
Weight - Metric Tons           105019
Standard Digging Depth - Ft     77848
Operating Capacity - Lbs        44226
                                 5259
Model                              78
Weight - Metric                     4
Weight - Lbs                        1
Cutting Width - Inches              1
Name: PrimarySizeBasis, dtype: int64
temp = pd.pivot_table(total_df.fillna(''), index='PrimarySizeBasis', values='SalePrice', aggfunc=['count', 'mean', 'median'])
temp.sort_values(by=('count', 'SalePrice')).tail(20)
count mean median
SalePrice SalePrice SalePrice
PrimarySizeBasis
Cutting Width - Inches 1 23000.000000 23000.0
Weight - Lbs 1 27500.000000 27500.0
Weight - Metric 4 22562.500000 22750.0
Model 78 25535.256410 17750.0
5259 15941.918616 14000.0
Operating Capacity - Lbs 44226 10612.493872 10000.0
Standard Digging Depth - Ft 77848 21202.527078 21000.0
Weight - Metric Tons 105019 35654.973148 29000.0
Horsepower 180262 38455.691062 32000.0

1.4.47 PrimaryLower

total_df['PrimaryLower'].fillna('').value_counts()
14.0       62037
20.0       17943
130.0      17113
150.0      15322
85.0       15054
           ...  
1.8            3
4.5            2
25000.0        1
2.7            1
300.0          1
Name: PrimaryLower, Length: 76, dtype: int64

**总结:**发现的规律如下:

  1. 从EXCEL表中可以明显看出部分特征是要么一起出现,要么一起空,这些可以在创建衍生特征时做成交叉特征。

  2. 部分特征虽然是object字段,但是能排出顺序,比如“Low、Medium、High”等,和‘xx inch’等,后面处理成定序变量。

  3. 空值都单独用一列标记。

  4. 没有特殊规律的类别特征统一用Label Encoder编码。

  5. 与价格的透视表可以为分箱提供依据。

EDA到此堂堂完结!!!!!!!!!!!!!!!!!!!!!!!

2. 数据清洗+数据预处理

2.1 创建衍生变量

2.1.1 SaleDate

def to_sin(n, i):
    return round(np.sin((2*np.pi/n)*(i-1)+(2*np.pi/7)) + 1, 2)
total_df['SaleYear'] = total_df['SaleDate'].dt.year
total_df['SaleMonth'] = total_df['SaleDate'].dt.month
total_df['SaleDay'] = total_df['SaleDate'].dt.day
total_df['SaleDayOfWeek'] = total_df['SaleDate'].dt.dayofweek
total_df['SaleDayOfYear'] = total_df['SaleDate'].dt.dayofyear
# 删除原来的SaleDate特征
total_df.drop('SaleDate', axis=1, inplace=True)
# 尝试余弦化拉近1月和12月的距离
total_df['SaleDayOfWeek_sin'] = total_df['SaleDayOfWeek'].map(lambda x: to_sin(7, x)).value_counts()
total_df['SaleMonth_sin'] = total_df['SaleMonth'].map(lambda x: to_sin(12, x)).value_counts()

2.1.2 Stick、Turbocharged

def combine_features(df, col_list):
    temp = df[col_list[0]].astype(str)
    for col in col_list[1:]:
        temp += df[col].astype(str)
    return temp
total_df['Stick__Turbocharged'] = combine_features(total_df, ['Stick', 'Turbocharged'])
total_df['Stick__Turbocharged'].value_counts()
nannan                         331602
StandardNone or Unspecified     47981
ExtendedNone or Unspecified     29130
ExtendedYes                      2112
StandardYes                      1873
Name: Stick__Turbocharged, dtype: int64

2.1.3 Blade_Extension、Blade_Width、Enclosure_Type、Engine_Horsepower

total_df['Blade__Blade__Enclosure__Engine'] = combine_features(total_df, ['Blade_Extension', 'Blade_Width', 'Enclosure_Type', 'Engine_Horsepower'])
total_df['Blade__Blade__Enclosure__Engine'].value_counts()
nannannannan                                                         386715
None or UnspecifiedNone or UnspecifiedNone or UnspecifiedNo            8572
None or Unspecified14'None or UnspecifiedNo                            7307
None or Unspecified12'None or UnspecifiedNo                            4466
None or Unspecified14'Low ProfileNo                                    1088
None or Unspecified16'None or UnspecifiedNo                             818
None or UnspecifiedNone or UnspecifiedLow ProfileNo                     435
None or Unspecified12'Low ProfileNo                                     416
None or Unspecified14'Low ProfileVariable                               377
None or Unspecified14'High ProfileNo                                    348
None or Unspecified14'None or UnspecifiedVariable                       343
None or Unspecified13'None or UnspecifiedNo                             313
None or UnspecifiedNone or UnspecifiedNone or UnspecifiedVariable       190
Yes14'None or UnspecifiedNo                                             126
None or Unspecified14'High ProfileVariable                              122
None or UnspecifiedNone or UnspecifiedHigh ProfileNo                    112
None or Unspecified<12'None or UnspecifiedNo                             99
None or Unspecified12'High ProfileNo                                     96
YesNone or UnspecifiedNone or UnspecifiedNo                              86
Yes12'None or UnspecifiedNo                                              84
None or UnspecifiedNone or UnspecifiedLow ProfileVariable                74
None or Unspecified16'Low ProfileNo                                      58
Yes14'Low ProfileNo                                                      55
Yes14'Low ProfileVariable                                                50
Yes12'Low ProfileNo                                                      47
None or Unspecified12'High ProfileVariable                               43
None or Unspecified16'High ProfileNo                                     34
None or UnspecifiedNone or UnspecifiedHigh ProfileVariable               27
None or Unspecified12'Low ProfileVariable                                21
Yes14'High ProfileVariable                                               21
Yes14'None or UnspecifiedVariable                                        18
None or Unspecified12'None or UnspecifiedVariable                        17
None or Unspecified13'Low ProfileNo                                      16
Yes16'Low ProfileNo                                                      15
Yes14'High ProfileNo                                                     12
Yes16'None or UnspecifiedNo                                              10
YesNone or UnspecifiedLow ProfileNo                                       9
YesNone or UnspecifiedNone or UnspecifiedVariable                         9
Yes16'High ProfileNo                                                      6
Yes12'High ProfileNo                                                      6
YesNone or UnspecifiedLow ProfileVariable                                 4
None or Unspecified16'High ProfileVariable                                4
None or Unspecified16'None or UnspecifiedVariable                         4
None or Unspecified16'Low ProfileVariable                                 4
Yes13'None or UnspecifiedNo                                               4
Yes16'Low ProfileVariable                                                 3
Yes12'Low ProfileVariable                                                 3
YesNone or UnspecifiedHigh ProfileNo                                      3
Yes12'High ProfileVariable                                                2
Yes16'High ProfileVariable                                                2
Yes16'None or UnspecifiedVariable                                         2
None or Unspecified13'None or UnspecifiedVariable                         1
None or Unspecified13'High ProfileNo                                      1
Name: Blade__Blade__Enclosure__Engine, dtype: int64

2.1.4 Pushblock、Scarifier、Tip_Control

total_df['Pushblock__Scarifier__TipControl'] = combine_features(total_df, ['Pushblock', 'Scarifier', 'Tip_Control'])
total_df['Pushblock__Scarifier__TipControl'].value_counts()
nannannan                                                    386704
None or UnspecifiedYesNone or Unspecified                      6742
None or UnspecifiedNone or UnspecifiedNone or Unspecified      6228
None or UnspecifiedYesSideshift & Tip                          3088
YesNone or UnspecifiedNone or Unspecified                      2560
None or UnspecifiedNone or UnspecifiedSideshift & Tip          2401
YesYesNone or Unspecified                                      1302
YesNone or UnspecifiedSideshift & Tip                          1182
None or UnspecifiedYesTip                                      1103
YesYesSideshift & Tip                                           493
None or UnspecifiedNone or UnspecifiedTip                       455
YesYesTip                                                       226
YesNone or UnspecifiedTip                                       203
nanYesnan                                                         7
nanNone or Unspecifiednan                                         4
Name: Pushblock__Scarifier__TipControl, dtype: int64

2.1.5 Coupler_System、Grouser_Tracks、Hydraulics_Flow

total_df['CouplerSystem__GrouserTracks__HydraulicsFlow'] = combine_features(total_df, ['Coupler_System', 'Grouser_Tracks', 'Hydraulics_Flow'])
total_df['CouplerSystem__GrouserTracks__HydraulicsFlow'].value_counts()
nannannan                                                    367724
None or UnspecifiedNone or UnspecifiedStandard                39040
YesNone or UnspecifiedStandard                                 2289
None or UnspecifiedYesStandard                                 2149
YesYesStandard                                                  773
None or UnspecifiedNone or UnspecifiedHigh Flow                 364
YesNone or UnspecifiedHigh Flow                                 122
None or Unspecifiednannan                                        91
None or UnspecifiedYesHigh Flow                                  57
YesYesHigh Flow                                                  54
None or UnspecifiedYesNone or Unspecified                        22
Yesnannan                                                         8
None or UnspecifiedNone or UnspecifiedNone or Unspecified         4
YesNone or UnspecifiedNone or Unspecified                         1
Name: CouplerSystem__GrouserTracks__HydraulicsFlow, dtype: int64

2.1.6 Track_Type Undercarriage_Pad_Width Stick_Length Thumb Pattern_Changer Grouser_Type

col_list = ('Track_Type	Undercarriage_Pad_Width	Stick_Length	Thumb	Pattern_Changer	Grouser_Type').split('\t')
total_df['TUSTPG'] = combine_features(total_df, col_list)
total_df['TUSTPG'].value_counts()
nannannannannannan                                                                          309643
SteelNone or UnspecifiedNone or UnspecifiedNone or UnspecifiedNone or UnspecifiedDouble      38336
RubberNone or UnspecifiedNone or UnspecifiedNone or UnspecifiedNone or UnspecifiedDouble     10031
SteelNone or UnspecifiedNone or UnspecifiedManualNone or UnspecifiedDouble                    3900
SteelNone or UnspecifiedNone or UnspecifiedNone or UnspecifiedNone or UnspecifiedTriple       3214
                                                                                             ...  
Steel32 inch8' 2"ManualNone or UnspecifiedDouble                                                 1
Steel30 inch12' 10"ManualNone or UnspecifiedTriple                                               1
Steel20 inch8' 4"ManualNone or UnspecifiedTriple                                                 1
Steel32 inch8' 4"None or UnspecifiedNone or UnspecifiedTriple                                    1
Rubber30 inchNone or UnspecifiedHydraulicNone or UnspecifiedDouble                               1
Name: TUSTPG, Length: 1065, dtype: int64

2.1.7 Backhoe_Mounting Blade_Type Travel_Controls

col_list = ('Backhoe_Mounting	Blade_Type	Travel_Controls').split('\t')
total_df['BBT'] = combine_features(total_df, col_list)
total_df['BBT'].value_counts()
nannannan                                                    330802
None or UnspecifiedPATNone or Unspecified                     37249
None or UnspecifiedStraightNone or Unspecified                11990
None or UnspecifiedNone or UnspecifiedNone or Unspecified     11106
None or UnspecifiedSemi UNone or Unspecified                   6582
None or UnspecifiedSemi UDifferential Steer                    2122
None or UnspecifiedVPATFinger Tip                              1857
None or UnspecifiedUNone or Unspecified                        1816
None or UnspecifiedAngleNone or Unspecified                    1560
None or UnspecifiedVPATNone or Unspecified                     1092
None or UnspecifiedPATDifferential Steer                       1011
None or UnspecifiedStraightDifferential Steer                   913
nanNo2 Pedal                                                    728
None or UnspecifiedVPATDifferential Steer                       708
None or UnspecifiedPATFinger Tip                                584
nanStraight2 Pedal                                              416
None or UnspecifiedPATLever                                     412
None or UnspecifiedPATPedal                                     370
None or UnspecifiedNone or UnspecifiedDifferential Steer        327
None or UnspecifiedNone or UnspecifiedLever                     285
None or UnspecifiedAngleDifferential Steer                      109
None or UnspecifiedSemi UFinger Tip                              97
None or UnspecifiedSemi ULever                                   96
None or UnspecifiedNone or UnspecifiedFinger Tip                 77
None or UnspecifiedStraightFinger Tip                            76
None or UnspecifiedStraightLever                                 56
None or UnspecifiedUDifferential Steer                           54
None or UnspecifiedNone or UnspecifiedPedal                      33
None or UnspecifiedVPATLever                                     24
nanNonan                                                         15
None or UnspecifiedLandfillNone or Unspecified                   14
None or UnspecifiedULever                                        14
YesNone or UnspecifiedNone or Unspecified                        13
None or UnspecifiedAngleLever                                    13
None or UnspecifiedLandfillDifferential Steer                    12
nannan1 Speed                                                    11
nannanNone or Unspecified                                        10
None or UnspecifiedSemi UPedal                                   10
None or UnspecifiedCoalNone or Unspecified                        8
YesPATNone or Unspecified                                         7
None or UnspecifiedStraightPedal                                  7
nanStraightnan                                                    3
None or UnspecifiedUPedal                                         2
None or UnspecifiedCoalLever                                      2
None or UnspecifiedAnglePedal                                     1
nanUnan                                                           1
None or UnspecifiedUFinger Tip                                    1
None or UnspecifiedAngleFinger Tip                                1
None or UnspecifiedCoalDifferential Steer                         1
Name: BBT, dtype: int64

2.1.8 Differential_Type Steering_Controls

col_list = ('Differential_Type	Steering_Controls').split('\t')
total_df['DS'] = combine_features(total_df, col_list)
total_df['DS'].value_counts()
nannan                         341120
StandardConventional            69411
Limited SlipConventional         1154
StandardCommand Control           564
No SpinConventional               209
StandardFour Wheel Standard       139
Standardnan                        54
Limited SlipCommand Control        27
nanWheel                           14
No SpinCommand Control              3
Lockingnan                          2
StandardNo                          1
Name: DS, dtype: int64

2.1.9 PrimarySizeBasis PrimaryLower

col_list = ('PrimarySizeBasis	PrimaryLower').split('\t')
total_df['PP'] = combine_features(total_df, col_list)
total_df['PP'].value_counts()
Standard Digging Depth - Ft14.0    57395
Horsepower20.0                     17941
Horsepower130.0                    17113
Horsepower150.0                    15299
Horsepower85.0                     15054
                                   ...  
Weight - Metric Tons20.0               2
Weight - Lbs25000.0                    1
Cutting Width - Inches0.0              1
Weight - Metric Tons2.7                1
Weight - Metric Tons300.0              1
Name: PP, Length: 90, dtype: int64

2.1.10 fiBaseModel

def is_hybrid(x):
    try:
        return type(eval(x)) == str
    except:
        return True
total_df['fiBaseModel_is_hybrid'] = total_df['fiBaseModel'].map(is_hybrid)

2.1.11 ProductGroupDesc

result = total_df['ProductGroupDesc'].str.split(' ')
total_df['ProductGroupDesc_father'] = [i[-1] for i in result]    
total_df['ProductGroupDesc_father'].value_counts()
Loaders       126412
Excavators    104230
Tractors       82582
Loader         73216
Graders        26258
Name: ProductGroupDesc_father, dtype: int64

2.2 独热编码

def to_dummies(df, col_name):
    name_dict = {value: col_name + '_is_' + str(value) for value in df[col_name].unique()}
    return pd.get_dummies(df[col_name]).rename(columns=name_dict)
# col_names = ['CouplerSystem__GrouserTracks__HydraulicsFlow']

# for col in col_names:
#     total_df = pd.concat([total_df, to_dummies(total_df, col)], axis=1)

# total_df.drop(columns=col_names, inplace=True)

感觉没有明显提升,暂时不做了。

2.3 标签编码

2.3.1 UsageBand

usageband_dict = {
    'Low': 1,
    'Medium': 2,
    'High': 3
}
total_df['UsageBand'] = total_df['UsageBand'].map(usageband_dict).fillna(0)
total_df['UsageBand'].value_counts()
0.0    339028
2.0     35832
1.0     25311
3.0     12527
Name: UsageBand, dtype: int64

2.3.2 ProductSize

total_df['ProductSize_is_missing'] = total_df['ProductSize'].isna()
total_df['ProductSize'].value_counts()
Medium            64342
Large / Medium    51297
Small             27057
Mini              25721
Large             21396
Compact            6280
Name: ProductSize, dtype: int64
ProductSize_dict = {
    'Compact': 1,
    'Mini': 2,
    'Small': 3,
    'Medium': 4,
    'Large / Medium': 5,
    'Large': 6,
}
total_df['ProductSize'] = total_df['ProductSize'].map(ProductSize_dict).fillna(0)
total_df['ProductSize'].value_counts()
0.0    216605
4.0     64342
5.0     51297
3.0     27057
2.0     25721
6.0     21396
1.0      6280
Name: ProductSize, dtype: int64

2.3.3 Undercarriage_Pad_Width

total_df['Undercarriage_Pad_Width_is_missing'] = total_df['Undercarriage_Pad_Width'].isna()
total_df['Undercarriage_Pad_Width'].value_counts()
None or Unspecified    82444
32 inch                 5287
28 inch                 3152
24 inch                 2998
20 inch                 2664
30 inch                 1602
36 inch                 1544
18 inch                 1439
34 inch                  540
16 inch                  481
31 inch                  191
27 inch                  144
22 inch                  135
26 inch                   98
33 inch                   94
14 inch                   51
15 inch                   33
25 inch                   17
31.5 inch                  2
Name: Undercarriage_Pad_Width, dtype: int64
total_df['Undercarriage_Pad_Width'] = total_df['Undercarriage_Pad_Width'].fillna('0 inch').map(lambda x: 1 if x == 'None or Unspecified' else eval(x[:-4]))
total_df['Undercarriage_Pad_Width'].value_counts()
0.0     309782
1.0      82444
32.0      5287
28.0      3152
24.0      2998
20.0      2664
30.0      1602
36.0      1544
18.0      1439
34.0       540
16.0       481
31.0       191
27.0       144
22.0       135
26.0        98
33.0        94
14.0        51
15.0        33
25.0        17
31.5         2
Name: Undercarriage_Pad_Width, dtype: int64

2.3.4 Stick_Length

total_df['Stick_Length_is_missing'] = total_df['Stick_Length'].isna()
total_df['Stick_Length'].value_counts()
None or Unspecified    81539
9' 6"                   5832
10' 6"                  3519
11' 0"                  1601
9' 10"                  1463
9' 8"                   1462
9' 7"                   1423
12' 10"                 1087
10' 2"                  1004
8' 6"                    908
8' 2"                    614
10' 10"                  414
12' 8"                   322
11' 10"                  307
8' 4"                    274
8' 10"                   104
12' 4"                   103
9' 5"                    101
15' 9"                    87
6' 3"                     51
13' 7"                    11
14' 1"                     7
13' 10"                    7
13' 9"                     7
19' 8"                     5
7' 10"                     3
15' 4"                     3
24' 3"                     2
9' 2"                      1
Name: Stick_Length, dtype: int64
result = []

for i in total_df['Stick_Length'].fillna(0):
    if  i == 0:
        pass
    elif i == 'None or Unspecified':
        i = 1
    else:
        a = i.find("'")
        b = i.find('"')
        i = round(int(i[:a]) + int(i[a+2:b])/12, 2)
    result.append(i)
total_df['Stick_Length'] = result
total_df['Stick_Length'].value_counts()
0.00     310437
1.00      81539
9.50       5832
10.50      3519
11.00      1601
9.83       1463
9.67       1462
9.58       1423
12.83      1087
10.17      1004
8.50        908
8.17        614
10.83       414
12.67       322
11.83       307
8.33        274
8.83        104
12.33       103
9.42        101
15.75        87
6.25         51
13.58        11
14.08         7
13.83         7
13.75         7
19.67         5
7.83          3
15.33         3
24.25         2
9.17          1
Name: Stick_Length, dtype: int64

2.3.2 其余随机标签编码

le = LabelEncoder()
for col in total_df.columns:
    # 标记空值(事实证明用处不大。。。,只有一列标记派上了用场)
    if total_df[col].isna().sum() > 1000:
        total_df[col + '_is_missing'] = total_df[col].isna()
    # 编码
    if total_df[col].dtype == 'object':
        total_df[col] = le.fit_transform(total_df[col])

2.4 空值和异常值

# 异常值173
total_df['datasource'].map(lambda x: 172 if x == 173 else x)
0         132
1         132
2         132
3         132
4         132
         ... 
412693    149
412694    149
412695    149
412696    149
412697    149
Name: datasource, Length: 412698, dtype: int64
# 这一列的空值用众数填充
total_df['auctioneerID'].fillna(total_df['auctioneerID'].mode()[0], inplace=True)
total_df['MachineHoursCurrentMeter'].fillna(0, inplace=True)
total_df['PrimaryLower'].fillna(0, inplace=True)
total_df.info()

Int64Index: 412698 entries, 0 to 412697
Columns: 109 entries, SalePrice to SaleMonth_sin_is_missing
dtypes: bool(40), float64(10), int32(50), int64(9)
memory usage: 157.4 MB

2.5 删除重复列

def drop_dup_cols(df):
    cols = df.columns[60:]
    n = len(cols)
    for i in range(n-1):
        if len(df[df[cols[i]] != df[cols[i+1]]]) < 1000:
            df.drop(columns=[cols[i]], inplace=True)
drop_dup_cols(total_df)

3. 建模

# 以DataFrame的形式展现特征重要性
def show_feature_imp(model, df):
    return pd.DataFrame({df.columns[i]: model.feature_importances_[i] for i in range(len(df.columns))}, index=['imp']).T.sort_values(by='imp', ascending=False)
# 展示两个特征重要性表的共同倒数n%的特征
def show_low_features(df1, df2, ratio):
    n = int(len(df1) * ratio)
    df1_features = df1.tail(n).index
    df2_features = df2.tail(n).index
    return n, {feature: [df1.loc[feature].values[0], df2.loc[feature].values[0]] for feature in df1_features if feature in df2_features}
# 以字典的形式显示分数,可显示的分数有:R平方、MAE、MSE、RMSLE
def show_scores(y_test, y_pred):
    scores_dict = {}
    scores_dict['R2'] = r2_score(y_test, y_pred)
    scores_dict['MAE'] = mean_absolute_error(y_test, y_pred)
    scores_dict['MSE'] = mean_squared_error(y_test, y_pred)
    scores_dict['RMSLE'] = np.sqrt(mean_squared_log_error(y_test, y_pred))
    return scores_dict
# 被后续的逐步回归筛出来的特征
drop_list = ['SalePrice', 'MachineID', 'PrimaryLower_is_missing',  
             'Drive_System_is_missing', 'Forks_is_missing', 'ProductSize_is_missing',
             'Ride_Control_is_missing', 'Stick_is_missing', 'Stick_Length_is_missing', 'Transmission_is_missing', 
             'Turbocharged_is_missing', 'Engine_Horsepower_is_missing', 'Hydraulics_is_missing', 
             'Pad_Type_is_missing', 'Ripper_is_missing', 'Tip_Control_is_missing', 'Tire_Size_is_missing', 
             'Coupler_is_missing', 'Hydraulics_Flow_is_missing', 'Grouser_Type_is_missing', 
             'Backhoe_Mounting_is_missing', 'Travel_Controls_is_missing', 'Steering_Controls_is_missing', 
             'Pushblock_is_missing', 'SaleDayOfWeek', 'SaleMonth', 'Differential_Type', 
             'Stick__Turbocharged', 'SaleDayOfYear', 'MachineHoursCurrentMeter_is_missing', 
             'Scarifier', 'Backhoe_Mounting', 'Forks', 'Track_Type', 'Engine_Horsepower', 
             'Blade_Extension', 'Coupler_System', 'SaleMonth_sin', 'SaleDayOfWeek_sin', 'SaleMonth_sin_is_missing',
             'fiBaseModel_is_hybrid', 'Hydraulics_Flow', 'fiModelDescriptor', 'SaleDay', 'Hydraulics', 'Pushblock', 'Transmission',
             'PrimarySizeBasis', 'Stick','DS', 'Tip_Control', 'Grouser_Tracks', 'fiSecondaryDesc', 'fiModelSeries'
            ]
# 划分训练集、验证集
X = total_df.drop(drop_list, axis=1)
y = total_df['SalePrice']

X_train = X[X['SaleYear']<2012]
X_test = X[X['SaleYear']>=2012]
y_train = y[X['SaleYear']<2012]
y_test = y[X['SaleYear']>=2012]
raise KeyError
---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

Input In [1044], in ()
----> 1 raise KeyError


KeyError: 

3.1 RF

# rfr = RandomForestRegressor(
#     n_estimators = 100,
#     max_depth = 23,
#     min_samples_leaf = 2,
#     min_samples_split = 4,
#     n_jobs = -1,
#     random_state = 10
# )
# rfr.fit(X_train, y_train)
# show_scores(y_test, rfr.predict(X_test))
# rfr_imp = show_feature_imp(rfr, X_train)
# rfr_imp.head(20)
# rfr_grid = {
#     'n_estimators': [100, 150, 200],
#     'max_depth': [15, 20, 25],
#     'max_features': [0.5, 0.6, 0.7],
#     'min_samples_leaf': [2],
#     'min_samples_split': [4],
#     'max_samples': [2000],
#     'n_jobs': [-1]  
# }
# gscv = GridSearchCV(RandomForestRegressor(), param_grid=rfr_grid, cv=5, verbose=True)
# gscv.fit(X_train, y_train)

3.2 XGB

xgb_grid = {
    'n_estimators': [200],
    'learning_rate': [0.0835],
    'reg_lambda': [0.48, 0.5, 0.52]
}
gscv = GridSearchCV(XGBRegressor(), param_grid=xgb_grid)
gscv.fit(X_train, y_train)
gscv.best_params_
xgb = XGBRegressor(
    n_estimators = 250, # 这个很重要
    max_depth = 11, # 这个很重要
    learning_rate = 0.11, # 这个很重要
    reg_lambda = 0.2, # 这个比较重要
    n_jobs=-1,
    random_state = 13
)
xgb.fit(X_train, y_train)
show_scores(y_test, xgb.predict(X_test))
{'R2': 0.884087856079633,
 'MAE': 5745.138324664967,
 'MSE': 79616854.10269143,
 'RMSLE': 0.23743323282056006}

总结:

调参的时候自己要先清楚哪些参数对模型影响大,最好用gscv来两次,第一次先摸清大概范围,第二次再精确到小数点后1-3位。

4. 特征筛选

逐步回归筛选特征(注意,后续提到的逐步回归都以本文自定义的“青春版”逐步回归函数为准)

  1. 后向逐步回归:按照给定的特征列表(按照之前fit一遍的特征重要性排升序)遍历。每丢一个特征计算一次RMSLE,如果分降了就保持丢弃这个特征。可以设定模型、遍历次数、丢弃特征的RMSLE下降阈值。

  2. 前向逐步回归:按照给定的特征列表(按照之前fit一遍的特征重要性排序)遍历。每丢一个特征计算一次RMSLE,如果分降了就保持丢弃这个特征。可以设定模型、遍历次数、丢弃特征的RMSLE下降阈值。

  3. 如果希望更稳重,不要如此轻易就丢弃一个特征,可选的函数改进方式是改为二重循环:每一轮都将每个特征遍历一遍,计算前后的得分(或者误差)差值gain,剔除gain最小/大的特征,然后再进入下一轮,直到达到最大迭代次数或gain达到阈值。

def step_wise_reg(model, X_train, X_test, y_train, y_test, suspicion_list, method='backward', tol=0, show_info=True, dilution_ratio=1):
    X_train = X_train[::dilution_ratio]
    y_train = y_train[::dilution_ratio]
    
    if method == 'backward':
        drop_list = [] 
    elif method == 'forward':
        drop_list = list(suspicion_list)
    else:
        raise KeyError
    
    
    model.fit(X_train.drop(columns=drop_list), y_train)
    old_score = np.sqrt(mean_squared_log_error(y_test, pd.Series(model.predict(X_test.drop(columns=drop_list))).map(lambda x: 0 if x < 0 else x)))

    for last_feature in suspicion_list:
        
        if method == 'backward':
            drop_list.append(last_feature)
        else:
            drop_list.remove(last_feature)

        model.fit(X_train.drop(columns=drop_list), y_train)
        new_score = np.sqrt(mean_squared_log_error(y_test, pd.Series(model.predict(X_test.drop(columns=drop_list))).map(lambda x: 0 if x < 0 else x)))

        if show_info:
            print(old_score, new_score, drop_list)
        
        
        if old_score - new_score < tol:
            if method == 'backward':
                drop_list.remove(last_feature)   
            else:
                drop_list.append(last_feature)
        else:
            old_score = new_score

    return old_score, new_score, drop_list
old_score, new_score, add_drop_list1 = step_wise_reg(xgb, X_train, X_test, y_train, y_test, dilution_ratio=1, method='backward', suspicion_list=show_feature_imp(xgb, X_train).index[::-1])
add_drop_list1
old_score, new_score, add_drop_list2 = step_wise_reg(xgb, X_train, X_test, y_train, y_test, dilution_ratio=1, method='forward', tol=0.0001,suspicion_list=show_feature_imp(xgb, X_train).index[1:])
add_drop_list2

总结:

逐步回归十分好用,过一遍特征,误差就降下去了,推测特征被筛出去的原因主要是:

  1. 和其他特征的信息有一定相关性,即存在冗余信息。简单计算一下,就算全部是二分类特征,2的20次方也能表示出100万个不同的样本了,因此对于此次只有40万条数据的数据集来说,50+的特征绝对是存在冗余的,会干扰模型判断。

  2. 本身是个好特征,但没有经过恰当的分箱/编码处理,导致蕴含的信息不能传达给模型,反而起到了负提升。这就只有根据业务逻辑反复尝试了,一边尝试分箱,一边要回来调参。

  3. 本身是好特征,预处理也到位了,但模型进行逐步回归时没有同步调参,导致误差反而上升。尤其是前向逐步回归,从1个特征增加到50个特征,跨度如此之大,模型需要的参数必然是不同的。

以上2、3两点就造成了误筛。所以使用逐步回归时要注意:被筛出的特征也不一定就是没用的特征,还得自己判断。

5. 最终成果

show_scores(y_test, xgb.predict(X_test))
{'R2': 0.884087856079633,
 'MAE': 5745.138324664967,
 'MSE': 79616854.10269143,
 'RMSLE': 0.23743323282056006}
xgb_imp = show_feature_imp(xgb, X_train)
xgb_imp
imp
CouplerSystem__GrouserTracks__HydraulicsFlow 0.397497
ProductSize 0.249447
PP 0.079305
Ride_Control 0.042499
fiManufacturerID 0.037125
YearMade 0.031770
PrimaryLower 0.027179
ProductGroupDesc_father 0.015217
SaleYear 0.012649
fiModelDesc 0.012630
Blade_Width 0.010005
Pushblock__Scarifier__TipControl 0.009107
Drive_System 0.006723
Ripper 0.006685
Enclosure 0.005481
Blade__Blade__Enclosure__Engine 0.004808
Pad_Type 0.004427
fiProductClassDesc 0.004194
ProductGroupDesc 0.003965
Travel_Controls 0.003888
Tire_Size 0.003268
Pattern_Changer 0.003167
ModelID 0.002938
UsageBand 0.002606
BBT 0.002432
Steering_Controls 0.002027
Turbocharged 0.001827
fiBaseModel 0.001778
Blade_Type 0.001742
Undercarriage_Pad_Width 0.001496
Stick_Length 0.001339
TUSTPG 0.001334
Thumb 0.001193
Coupler 0.001181
Grouser_Type 0.001138
MachineHoursCurrentMeter 0.001100
state 0.001093
auctioneerID_is_missing 0.001045
Enclosure_Type 0.001007
auctioneerID 0.000875
datasource 0.000812

【Python机器学习】回归模型:推土机售价预测_第5张图片

你可能感兴趣的:(机器学习,机器学习,python,回归,XGBOOST,逐步回归)