ML之R:通过数据预处理(缺失值/异常值/特殊值的处理/长尾转正态分布/目标log变换/柱形图-箱形图-小提琴图可视化/构造特征/特征筛选)利用算法实现二手汽车产品交易价格回归预测之详细攻略
目录
二手汽车产品交易价格预测
赛题背景
字段说明
通过数据预处理利用LightGBM算法实现二手汽车产品交易价格回归预测
# 一、定义数据集
# 1.1、载入训练集和测试集
# 1.2、简略观察数据
# 1.3、分离特征与标签
# 1.4、合并训练集、测试集(标记数据来源):以便同步各种操作(特征处理、构造特征等)
# 1.5、划分特征类型
# B1.7、纠正字段数据类型
# B1.8、纠正后重新统计
# T1.1、统计每个【类别型】特征的子分类
# T1.2、统计每个【类别型】特征的多样性
# 二、特征工程/数据集预处理
# 2.1、缺失值分析与处理
# 2.1.1、缺失值统计分析
# T1、所有特征样本个数(非空数值)柱状图可视化
# T2、仅缺失值的特征空值占比柱状图可视化
# 2.1.2、缺失值填充处理
# T1、两大类型数据缺失值填充
# 2.2、异常值分析与处理
# T2、基于3-Sigma标准差的删除异常样本点+箱线图对比可视化
# 2.3、特殊值的分析与处理
# T1、将某字段的特殊字符替换填充
# 2.4、特殊字段的分析与处理
# 2.4.1、寻找严重失衡/倾斜分布的字段
# 2.5、变量分布的分析与处理
# 2.5.1、统计并可视化所有变量的偏态skew、峰态kurt
# 2.5.2、【数字型】特征的长尾分布转为正态分布
# 2.6、目标变量的分析与处理
# 2.6.1、查看目标变量的分布
# 2.6.2、计算目标变量的skew、kurt
# 2.6.3、目标变量分布log变换
# 2.7、【类别型】特征分析
# 2.7.1、各个特征的丰富度统计及其可视化
# 2.7.2、各个特征的与目标变量的柱形图/箱形图/小提琴图可视化
# 2.8、【数字型】特征分析与处理
# 2.8.1、【数字型】特征分布性可视化
# 2.8.2、【数字型】特征相关性分析
# T1、【数字型】特征间的PCC热图可视化
# T3、【数字型】特征间的散点图可视化
# 2.9、构造特征
# 2.10、数据规范化
# 2.11、定义入模特征
# 2.11.1、删除特征
# 2.11.2、特征筛选
# T2、包裹式wrapper
# T3、嵌入式Embedded(最常用)
# 2.12、导出入模数据集
三、模型训练与验证
官网地址:零基础入门数据挖掘 - 二手车交易价格预测_学习赛_赛题与数据_天池大赛-阿里云天池
赛题以二手车市场为背景,要求选手预测二手汽车的交易价格。
该数据来自某交易平台的二手车交易记录,总数据量超过40w,包含31列变量信息,其中15列为匿名变量。为了保证比赛的公平性,将会从中抽取15万条作为训练集,5万条作为测试集A,5万条作为测试集B,同时会对name、model、brand和regionCode等信息进行脱敏。
Field |
Description |
SaleID |
交易ID,唯一编码 |
name |
汽车交易名称,已脱敏 汽车编码 |
regDate |
汽车注册日期,例如20160101,2016年01月01日 |
model |
车型编码,已脱敏 |
brand |
汽车品牌,已脱敏 |
bodyType |
车身类型:豪华轿车:0,微型车:1,厢型车:2,大巴车:3,敞篷车:4,双门汽车:5,商务车:6,搅拌车:7 |
fuelType |
燃油类型:汽油:0,柴油:1,液化石油气:2,天然气:3,混合动力:4,其他:5,电动:6 |
gearbox |
变速箱:手动:0,自动:1 |
power |
发动机功率:范围 [ 0, 600 ] |
kilometer |
汽车已行驶公里,单位万km |
notRepairedDamage |
汽车有尚未修复的损坏:是:0,否:1 |
regionCode |
地区编码,已脱敏 |
seller |
销售方:个体:0,非个体:1 |
offerType |
报价类型:提供:0,请求:1 |
creatDate |
汽车上线时间,即开始售卖时间 |
price |
二手车交易价格(预测目标) |
v系列特征 |
匿名特征,包含v0-14在内15个匿名特征 |
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | notRepairedDamage | regionCode | seller | offerType | creatDate | price | v_0 | v_1 | v_2 | v_3 | v_4 | v_5 | v_6 | v_7 | v_8 | v_9 | v_10 | v_11 | v_12 | v_13 | v_14 |
0 | 736 | 20040402 | 30 | 6 | 1 | 0 | 0 | 60 | 12.5 | 0 | 1046 | 0 | 0 | 20160404 | 1850 | 43.35779631 | 3.966344166 | 0.050257094 | 2.159744094 | 1.143786187 | 0.235675907 | 0.101988241 | 0.129548661 | 0.022816367 | 0.097461829 | -2.881803239 | 2.804096771 | -2.420820793 | 0.795291943 | 0.9147625 |
1 | 2262 | 20030301 | 40 | 1 | 2 | 0 | 0 | 0 | 15 | - | 4366 | 0 | 0 | 20160309 | 3600 | 45.30527302 | 5.236111898 | 0.137925324 | 1.38065746 | -1.422164921 | 0.264777256 | 0.121003594 | 0.135730707 | 0.026597448 | 0.020581663 | -4.900481882 | 2.096337644 | -1.030482837 | -1.722673775 | 0.245522411 |
2 | 14874 | 20040403 | 115 | 15 | 1 | 0 | 0 | 163 | 12.5 | 0 | 2806 | 0 | 0 | 20160402 | 6222 | 45.97835906 | 4.823792215 | 1.319524152 | -0.998467274 | -0.996911035 | 0.251410148 | 0.114912277 | 0.165147493 | 0.062172837 | 0.027074824 | -4.84674926 | 1.803558941 | 1.565329625 | -0.832687327 | -0.229962856 |
3 | 71865 | 19960908 | 109 | 10 | 0 | 0 | 1 | 193 | 15 | 0 | 434 | 0 | 0 | 20160312 | 2400 | 45.6874782 | 4.492574134 | -0.050615843 | 0.883599671 | -2.228078725 | 0.274293171 | 0.110300085 | 0.121963746 | 0.033394547 | 0 | -4.509598824 | 1.285939744 | -0.501867908 | -2.438352737 | -0.478699379 |
4 | 111080 | 20120103 | 110 | 5 | 1 | 0 | 0 | 68 | 5 | 0 | 6977 | 0 | 0 | 20160313 | 5200 | 44.38351084 | 2.031433258 | 0.572168948 | -1.571239028 | 2.246088325 | 0.228035622 | 0.073205054 | 0.091880479 | 0.078819385 | 0.121534241 | -1.896240279 | 0.910783134 | 0.931109559 | 2.83451782 | 1.923481963 |
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 31 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 SaleID 150000 non-null int64
1 name 150000 non-null int64
2 regDate 150000 non-null int64
3 model 149999 non-null float64
4 brand 150000 non-null int64
5 bodyType 145494 non-null float64
6 fuelType 141320 non-null float64
7 gearbox 144019 non-null float64
8 power 150000 non-null int64
9 kilometer 150000 non-null float64
10 notRepairedDamage 150000 non-null object
11 regionCode 150000 non-null int64
12 seller 150000 non-null int64
13 offerType 150000 non-null int64
14 creatDate 150000 non-null int64
15 price 150000 non-null int64
16 v_0 150000 non-null float64
17 v_1 150000 non-null float64
18 v_2 150000 non-null float64
19 v_3 150000 non-null float64
20 v_4 150000 non-null float64
21 v_5 150000 non-null float64
22 v_6 150000 non-null float64
23 v_7 150000 non-null float64
24 v_8 150000 non-null float64
25 v_9 150000 non-null float64
26 v_10 150000 non-null float64
27 v_11 150000 non-null float64
28 v_12 150000 non-null float64
29 v_13 150000 non-null float64
30 v_14 150000 non-null float64
dtypes: float64(20), int64(10), object(1)
memory usage: 35.5+ MB
used_car.info:
None
used_car.shape: (150000, 31) 31 150000
used_car.columns:
Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType',
'gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode',
'seller', 'offerType', 'creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3',
'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12',
'v_13', 'v_14'],
dtype='object')
used_car.dtypes:
float64 20
int64 10
object 1
dtype: int64
used_car.head:
SaleID name regDate model ... v_11 v_12 v_13 v_14
0 0 736 20040402 30.0 ... 2.804097 -2.420821 0.795292 0.914762
1 1 2262 20030301 40.0 ... 2.096338 -1.030483 -1.722674 0.245522
2 2 14874 20040403 115.0 ... 1.803559 1.565330 -0.832687 -0.229963
3 3 71865 19960908 109.0 ... 1.285940 -0.501868 -2.438353 -0.478699
4 4 111080 20120103 110.0 ... 0.910783 0.931110 2.834518 1.923482
149995 149995 163978 20000607 121.0 ... -2.983973 0.589167 -1.304370 -0.302592
149996 149996 184535 20091102 116.0 ... -2.774615 2.553994 0.924196 -0.272160
149997 149997 147587 20101003 60.0 ... -1.630677 2.290197 1.891922 0.414931
149998 149998 45907 20060312 34.0 ... -2.633719 1.414937 0.431981 -1.659014
149999 149999 177672 19990204 19.0 ... -3.179913 0.031724 -1.483350 -0.342674
[10 rows x 31 columns]
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | regionCode | seller | offerType | creatDate | price | v_0 | v_1 | v_2 | v_3 | v_4 | v_5 | v_6 | v_7 | v_8 | v_9 | v_10 | v_11 | v_12 | v_13 | v_14 | |
count | 150000 | 150000 | 150000 | 149999 | 150000 | 145494 | 141320 | 144019 | 150000 | 150000 | 150000 | 150000 | 150000 | 150000 | 150000 | 150000 | 150000 | 150000 | 150000 | 150000 | 150000 | 150000 | 150000 | 150000 | 150000 | 150000 | 150000 | 150000 | 150000 | 150000 |
mean | 74999.5 | 68349.17287 | 20034170.51 | 47.12902086 | 8.052733333 | 1.792369445 | 0.375842061 | 0.224942542 | 119.3165467 | 12.59716 | 2583.077267 | 6.67E-06 | 0 | 20160330.79 | 5923.327333 | 44.40626753 | -0.044809123 | 0.080765058 | 0.078833423 | 0.017874615 | 0.248203528 | 0.044923004 | 0.124692461 | 0.058143855 | 0.061995895 | -0.001000239 | 0.009034543 | 0.004812595 | 0.000312612 | -0.000688231 |
std | 43301.41453 | 61103.87509 | 53649.87926 | 49.53603965 | 7.864956341 | 1.760639503 | 0.548676623 | 0.417545932 | 177.1684192 | 3.919575532 | 1885.363218 | 0.002581989 | 0 | 106.7328088 | 7501.998477 | 2.457547906 | 3.641893018 | 2.929617945 | 2.026514036 | 1.193661387 | 0.045803971 | 0.051742787 | 0.20140953 | 0.029185756 | 0.035691979 | 3.772386394 | 3.286071221 | 2.517477676 | 1.288987639 | 1.038685151 |
min | 0 | 0 | 19910001 | 0 | 0 | 0 | 0 | 0 | 0 | 0.5 | 0 | 0 | 0 | 20150618 | 11 | 30.45197649 | -4.295588903 | -4.47067143 | -7.275036707 | -4.364565242 | 0 | 0 | 0 | 0 | 0 | -9.16819241 | -5.558206704 | -9.639552114 | -4.153898796 | -6.546555965 |
25% | 37499.75 | 11156 | 19990912 | 10 | 1 | 0 | 0 | 0 | 75 | 12.5 | 1018 | 0 | 0 | 20160313 | 1300 | 43.13579888 | -3.192349286 | -0.9706712 | -1.462580044 | -0.921191484 | 0.243615353 | 3.81E-05 | 0.062473533 | 0.035333687 | 0.033930177 | -3.72230288 | -1.951543007 | -1.871845761 | -1.057788984 | -0.437033668 |
50% | 74999.5 | 51638 | 20030912 | 30 | 6 | 1 | 0 | 0 | 110 | 15 | 2196 | 0 | 0 | 20160321 | 3250 | 44.61026572 | -3.052671416 | -0.38294689 | 0.099721985 | -0.075910429 | 0.257797966 | 0.000812059 | 0.095865898 | 0.057013598 | 0.058483667 | 1.624076331 | -0.358052697 | -0.130753318 | -0.036244604 | 0.141245993 |
75% | 112499.25 | 118841.25 | 20071109 | 66 | 13 | 3 | 1 | 0 | 150 | 15 | 3843 | 0 | 0 | 20160329 | 7700 | 46.0047209 | 4.000669795 | 0.241334852 | 1.565838202 | 0.868758435 | 0.265297259 | 0.102009298 | 0.125242945 | 0.079381571 | 0.087490548 | 2.844356776 | 1.255021657 | 1.776932949 | 0.942813083 | 0.680378075 |
max | 149999 | 196812 | 20151212 | 247 | 39 | 7 | 6 | 1 | 19312 | 15 | 8120 | 1 | 0 | 20160407 | 99999 | 52.30417826 | 7.320308375 | 19.0354965 | 9.854701534 | 6.82935164 | 0.291838113 | 0.151419596 | 1.404936375 | 0.160790985 | 0.222787488 | 12.35701062 | 18.81904247 | 13.84779152 | 11.14766861 | 8.658417877 |
float64 20 ['model', 'bodyType', 'fuelType', 'gearbox', 'kilometer', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13', 'v_14']
int32 0 []
int64 10 ['SaleID', 'name', 'regDate', 'brand', 'power', 'regionCode', 'seller', 'offerType', 'creatDate', 'price']
object_category_bool 1 ['notRepairedDamage']
others 0 []
字段回归正确数据类型:
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 SaleID 150000 non-null int64
1 name 150000 non-null int64
2 regDate 150000 non-null int64
3 model 149999 non-null object
4 brand 150000 non-null object
5 bodyType 145494 non-null object
6 fuelType 141320 non-null object
7 gearbox 144019 non-null object
8 power 150000 non-null int64
9 kilometer 150000 non-null float64
10 notRepairedDamage 150000 non-null object
11 regionCode 150000 non-null int64
12 seller 150000 non-null int64
13 offerType 150000 non-null int64
14 creatDate 150000 non-null int64
15 price 150000 non-null int64
16 v_0 150000 non-null float64
17 v_1 150000 non-null float64
18 v_2 150000 non-null float64
19 v_3 150000 non-null float64
20 v_4 150000 non-null float64
21 v_5 150000 non-null float64
22 v_6 150000 non-null float64
23 v_7 150000 non-null float64
24 v_8 150000 non-null float64
25 v_9 150000 non-null float64
26 v_10 150000 non-null float64
27 v_11 150000 non-null float64
28 v_12 150000 non-null float64
29 v_13 150000 non-null float64
30 v_14 150000 non-null float64
dtypes: float64(16), int64(9), object(6)
memory usage: 35.5+ MB
model | counts | brand | counts | bodyType | counts | fuelType | counts | gearbox | counts | notRepairedDamage | counts |
0 | 11762 | 0 | 31480 | 0 | 41420 | 0 | 91656 | 0 | 111623 | 0.0 | 111361 |
19 | 9573 | 4 | 16737 | 1 | 35272 | 1 | 46991 | 1 | 32396 | - | 24324 |
4 | 8445 | 14 | 16089 | 2 | 30324 | 2 | 2212 | null | 5981 | 1.0 | 14315 |
1 | 6038 | 10 | 14249 | 3 | 13491 | 3 | 262 | null | 0 | ||
29 | 5186 | 1 | 13794 | 4 | 9609 | 4 | 118 | ||||
48 | 5052 | 6 | 10217 | 5 | 7607 | 5 | 45 | ||||
40 | 4502 | 9 | 7306 | 6 | 6482 | 6 | 36 | ||||
26 | 4496 | 5 | 4665 | 7 | 1289 | null | 8680 | ||||
8 | 4391 | 13 | 3817 | null | 4506 | ||||||
31 | 3827 | 11 | 2945 | ||||||||
13 | 3762 | 3 | 2461 | ||||||||
17 | 3121 | 7 | 2361 | ||||||||
65 | 2730 | 16 | 2223 | ||||||||
49 | 2608 | 8 | 2077 | ||||||||
46 | 2454 | 25 | 2064 | ||||||||
30 | 2342 | 27 | 2053 | ||||||||
44 | 2195 | 21 | 1547 | ||||||||
5 | 2063 | 15 | 1458 | ||||||||
10 | 2004 | 19 | 1388 | ||||||||
21 | 1872 | 20 | 1236 | ||||||||
73 | 1789 | 12 | 1109 | ||||||||
11 | 1775 | 22 | 1085 | ||||||||
23 | 1696 | 26 | 966 | ||||||||
22 | 1524 | 30 | 940 | ||||||||
69 | 1522 | 17 | 913 | ||||||||
63 | 1469 | 24 | 772 | ||||||||
7 | 1460 | 28 | 649 | ||||||||
16 | 1349 | 32 | 592 | ||||||||
88 | 1309 | 29 | 406 | ||||||||
66 | 1250 | 37 | 333 | ||||||||
60 | 1177 | 2 | 321 | ||||||||
67 | 1084 | 31 | 318 | ||||||||
41 | 1078 | 18 | 316 | ||||||||
104 | 1020 | 36 | 228 | ||||||||
87 | 965 | 34 | 227 | ||||||||
115 | 927 | 33 | 218 | ||||||||
3 | 920 | 23 | 186 | ||||||||
121 | 811 | 35 | 180 | ||||||||
32 | 705 | 38 | 65 | ||||||||
77 | 675 | 39 | 9 | ||||||||
98 | 662 | null | 0 | ||||||||
247 | 1 | ||||||||||
null | 1 |
{'fuelType': 0.057866666666666663, 'gearbox': 0.03987333333333333, 'bodyType': 0.03004, 'model': 6.666666666666667e-06}
-------------------before fillna: SaleID 0 name 0 regDate 0 model 1 brand 0 bodyType 4506 fuelType 8680 gearbox 5981 power 0 kilometer 0 notRepairedDamage 0 regionCode 0 seller 0 offerType 0 creatDate 0 price 0 v_0 0 v_1 0 v_2 0 v_3 0 v_4 0 v_5 0 v_6 0 v_7 0 v_8 0 v_9 0 v_10 0 v_11 0 v_12 0 v_13 0 v_14 0 dtype: int64 |
-------------------after fillna: SaleID 0 name 0 regDate 0 model 0 brand 0 bodyType 0 fuelType 0 gearbox 0 power 0 kilometer 0 notRepairedDamage 0 regionCode 0 seller 0 offerType 0 creatDate 0 price 0 v_0 0 v_1 0 v_2 0 v_3 0 v_4 0 v_5 0 v_6 0 v_7 0 v_8 0 v_9 0 v_10 0 v_11 0 v_12 0 v_13 0 v_14 0 dtype: int64 |
3-Sigma,Delete number is: 963
Now column number is: 149037
outliers_low: Description of data less than the lower bound is:
count 0.0
mean NaN
std NaN
min NaN
25% NaN
50% NaN
75% NaN
max NaN
Name: power, dtype: float64
outliers_up: Description of data larger than the upper bound is:
count 963.000000
mean 846.836968
std 1929.418081
min 376.000000
25% 400.000000
50% 436.000000
75% 514.000000
max 19312.000000
Name: power, dtype: float64
df_train:
0.0 135685
1.0 14315
Name: notRepairedDamage, dtype: int64
seller
0 149999
1 1
Name: seller, dtype: int64
offerType
0 150000
Name: offerType, dtype: int64
price Skewness: 3.3464867626369608
price Kurtosis: 18.995183355632562
corr sort_values
price 1.000000
v_12 0.692823
v_8 0.685798
v_0 0.628397
regDate 0.611959
power 0.219834
v_5 0.164317
v_2 0.085322
v_6 0.068970
v_1 0.060914
v_14 0.035911
regionCode 0.014036
creatDate 0.002955
name 0.002030
SaleID -0.001043
seller -0.002004
v_13 -0.013993
brand -0.043799
v_7 -0.053024
v_4 -0.147085
v_9 -0.206205
v_10 -0.246175
v_11 -0.275320
kilometer -0.440519
v_3 -0.730946
offerType NaN
Name: price, dtype: float64
Int64Index: 150000 entries, 0 to 149999
Data columns (total 41 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 SaleID 150000 non-null float64
1 name 150000 non-null float64
2 regDate 150000 non-null float64
3 model 150000 non-null int32
4 brand 150000 non-null float64
5 bodyType 150000 non-null int32
6 fuelType 150000 non-null int32
7 gearbox 150000 non-null int32
8 power 150000 non-null float64
9 kilometer 150000 non-null float64
10 notRepairedDamage 150000 non-null int32
11 regionCode 150000 non-null float64
12 seller 150000 non-null float64
13 offerType 150000 non-null float64
14 creatDate 150000 non-null float64
15 price 150000 non-null int64
16 v_0 150000 non-null float64
17 v_1 150000 non-null float64
18 v_2 150000 non-null float64
19 v_3 150000 non-null float64
20 v_4 150000 non-null float64
21 v_5 150000 non-null float64
22 v_6 150000 non-null float64
23 v_7 150000 non-null float64
24 v_8 150000 non-null float64
25 v_9 150000 non-null float64
26 v_10 150000 non-null float64
27 v_11 150000 non-null float64
28 v_12 150000 non-null float64
29 v_13 150000 non-null float64
30 v_14 150000 non-null float64
31 city 150000 non-null int32
32 used_time 150000 non-null float64
33 brand_amount 150000 non-null float64
34 price_max_GBYbrand 150000 non-null float64
35 price_median_GBYbrand 150000 non-null float64
36 price_min_GBYbrand 150000 non-null float64
37 price_sum_GBYbrand 150000 non-null float64
38 price_std_GBYbrand 150000 non-null float64
39 price_average_GBYbrand 150000 non-null float64
40 power_bin 150000 non-null float64
catcols2LabelEncoder: 7 ['model', 'bodyType', 'fuelType', 'gearbox', 'notRepairedDamage', 'city', 'power_bin']
LEDict
{'model': {'0.0': 0, '1.0': 1, '10.0': 2, '100.0': 3, '101.0': 4, …… '93.0': 241, '94.0': 242, '95.0': 243, '96.0': 244, '97.0': 245, '98.0': 246, '99.0': 247, 'missing': 248},
'bodyType': {'0.0': 0, '1.0': 1, '2.0': 2, '3.0': 3, '4.0': 4, '5.0': 5, '6.0': 6, '7.0': 7, 'missing': 8},
'fuelType': {'0.0': 0, '1.0': 1, '2.0': 2, '3.0': 3, '4.0': 4, '5.0': 5, '6.0': 6, 'missing': 7},
'gearbox': {'0.0': 0, '1.0': 1, 'missing': 2},
'notRepairedDamage': {'0.0': 0, '1.0': 1},
'city': {'1': 0, '2': 1, '3': 2, '4': 3, '5': 4, '6': 5, '7': 6, '8': 7, 'missing': 8},
'power_bin': {'0.0': 0, '1.0': 1, '10.0': 2, '11.0': 3, '12.0': 4, '13.0': 5, '14.0': 6, '15.0': 7, '16.0': 8, '17.0': 9, '18.0': 10, '19.0': 11, '2.0': 12, '20.0': 13, '21.0': 14, '22.0': 15, '23.0': 16, '24.0': 17, '25.0': 18, '26.0': 19, '27.0': 20, '28.0': 21, '29.0': 22, '3.0': 23, '4.0': 24, '5.0': 25, '6.0': 26, '7.0': 27, '8.0': 28, '9.0': 29, 'missing': 30}}
after Encoder None
SaleID name ... price_average_GBYbrand power_bin
0 0.000000 0.003740 ... 0.073848 0
1 0.000007 0.011493 ... 0.234956 4
2 0.000013 0.075575 ... 0.251439 3
3 0.000020 0.365145 ... 0.212120 3
4 0.000027 0.564396 ... 0.065144 0
... ... ... ... ... ...
149995 0.999973 0.833171 ... 0.212120 3
149996 0.999980 0.937621 ... 0.100505 2
149997 0.999987 0.749888 ... 0.100505 1
149998 0.999993 0.233253 ... 0.212120 3
149999 1.000000 0.902750 ... 0.135830 3
k_featurenames ('bodyType', 'gearbox', 'kilometer', 'v_0', 'v_3', 'v_7', 'v_14', 'used_time', 'price_average_GBYbrand', 'power_bin')
LiR_MSE: 15993321.471365392
LiR_R2: 0.7057326262665655
intercept: -480467.6143789641
coef: [('v_5', 547248.1399627327), ('v_6', 517106.21250813385), ('v_7', 497333.878927629), ('v_10', 365570.90980079107), ('v_11', 171543.6146836947), ('v_8', 164227.00112090845), ('v_9', 128578.71403340848), ('power', 48863.6068485829), ('v_4', 43508.82539409367), ('v_14', 19828.850095900943), ('price_average_GBYbrand', 10572.754737316918), ('brand_amount', 6968.85289671065), ('price_median_GBYbrand', 6595.631072990875), ('price_max_GBYbrand', 2237.7971368071658), ('price_std_GBYbrand', 956.376637996673), ('gearbox', 679.4055026736423), ('used_time', 387.4132818355945), ('power_bin', 291.5175148434141), ('bodyType', 217.02045635721151), ('model', -2.4899364779927495), ('city', -10.258028861593232), ('notRepairedDamage', -20.486887939604173), ('fuelType', -24.736780561186862), ('price_min_GBYbrand', -3762.1215956763376), ('kilometer', -4299.815762643461), ('price_sum_GBYbrand', -6953.314648619096), ('v_0', -67643.70870061051), ('v_2', -142475.32076890446), ('v_13', -148508.8116222008), ('v_3', -276643.4143410439), ('v_12', -303764.0882419921), ('v_1', -379287.1351181704)]
# 选取少量样本数据的单个特征分析模型的预测与真实标签的分布差异
model | brand | bodyType | fuelType | gearbox | power | kilometer | notRepairedDamage | price | v_0 | v_1 | v_2 | v_3 | v_4 | v_5 | v_6 | v_7 | v_8 | v_9 | v_10 | v_11 | v_12 | v_13 | v_14 | city | used_time | brand_amount | price_max_GBYbrand | price_median_GBYbrand | price_min_GBYbrand | price_sum_GBYbrand | price_std_GBYbrand | price_average_GBYbrand | power_bin |
172 | 0.153846154 | 1 | 0 | 0 | 0.003106877 | 0.827586207 | 0 | 1850 | 0.590595856 | 0.711260858 | 0.192329457 | 0.550783711 | 0.49208436 | 0.807556985 | 0.673547173 | 0.092209629 | 0.141900787 | 0.437465451 | 0.292047846 | 0.343037207 | 0.307345583 | 0.323443384 | 0.49071564 | 0 | 0.470440114 | 0.324362111 | 0.587029733 | 0.029269972 | 0.002063983 | 0.211594546 | 0.186944095 | 0.073847955 | 0 |
183 | 0.025641026 | 2 | 0 | 0 | 0 | 1 | 0 | 3600 | 0.679716245 | 0.820573785 | 0.196059042 | 0.505302185 | 0.262857081 | 0.907274422 | 0.799127704 | 0.09660986 | 0.165416289 | 0.092382489 | 0.19826575 | 0.314003614 | 0.366540781 | 0.158887319 | 0.446701089 | 3 | 0.511167068 | 0.438022306 | 0.998980422 | 0.191081267 | 0.004127967 | 0.734020942 | 0.399306567 | 0.234956097 | 4 |
19 | 0.384615385 | 1 | 0 | 0 | 0.008440348 | 0.827586207 | 0 | 6222 | 0.710517994 | 0.785077631 | 0.246326649 | 0.366413622 | 0.300846812 | 0.861471263 | 0.758899638 | 0.117548023 | 0.386668675 | 0.12152758 | 0.200762016 | 0.301993289 | 0.477060408 | 0.217050409 | 0.415429397 | 1 | 0.470111671 | 0.046042388 | 0.433578101 | 0.259986226 | 0.091847265 | 0.082280125 | 0.22063358 | 0.251439007 | 3 |
12 | 0.256410256 | 0 | 0 | 1 | 0.009993786 | 1 | 0 | 2400 | 0.69720671 | 0.756563426 | 0.188038118 | 0.476284942 | 0.190861388 | 0.939881251 | 0.728439962 | 0.086810868 | 0.207689175 | 0 | 0.216425071 | 0.280759589 | 0.389047154 | 0.112115708 | 0.399070505 | 8 | 0.770418218 | 0.452480061 | 0.979412764 | 0.153236915 | 0.004127967 | 0.692603009 | 0.382034156 | 0.2121201 | 3 |
14 | 0.128205128 | 1 | 0 | 0 | 0.003521127 | 0.310344828 | 0 | 5200 | 0.637534583 | 0.544686477 | 0.214532645 | 0.332976348 | 0.590557679 | 0.781377111 | 0.483458255 | 0.06539832 | 0.490197784 | 0.545516459 | 0.337834311 | 0.265369968 | 0.450057777 | 0.456712468 | 0.557057054 | 5 | 0.157981169 | 0.147945728 | 0.294544743 | 0.046487603 | 0.009287926 | 0.088308958 | 0.126353182 | 0.065143906 | 0 |
更新中……