Tip:此部分为零基础入门数据挖掘的 Task3 特征工程 部分,带你来了解各种特征工程以及分析方法,欢迎大家后续多多交流。
赛题:零基础入门数据挖掘 - 二手车交易价格预测
地址:https://tianchi.aliyun.com/competition/entrance/231784/introduction?spm=5176.12281957.1004.1.38b02448ausjSX
对于特征进行进一步分析,并对于数据进行处理
完成对于特征工程的分析,并对于数据进行一些图表或者文字总结并打卡。
常见的特征工程包括:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from operator import itemgetter
%matplotlib inline
path = './data/'
train = pd.read_csv(path+'train.csv', sep=' ')
test = pd.read_csv(path+'testA.csv', sep=' ')
print(train.shape)
print(test.shape)
(150000, 31)
(50000, 30)
train.head()
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | notRepairedDamage | regionCode | seller | offerType | creatDate | price | v_0 | v_1 | v_2 | v_3 | v_4 | v_5 | v_6 | v_7 | v_8 | v_9 | v_10 | v_11 | v_12 | v_13 | v_14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 736 | 20040402 | 30.0 | 6 | 1.0 | 0.0 | 0.0 | 60 | 12.5 | 0.0 | 1046 | 0 | 0 | 20160404 | 1850 | 43.357796 | 3.966344 | 0.050257 | 2.159744 | 1.143786 | 0.235676 | 0.101988 | 0.129549 | 0.022816 | 0.097462 | -2.881803 | 2.804097 | -2.420821 | 0.795292 | 0.914762 |
1 | 1 | 2262 | 20030301 | 40.0 | 1 | 2.0 | 0.0 | 0.0 | 0 | 15.0 | - | 4366 | 0 | 0 | 20160309 | 3600 | 45.305273 | 5.236112 | 0.137925 | 1.380657 | -1.422165 | 0.264777 | 0.121004 | 0.135731 | 0.026597 | 0.020582 | -4.900482 | 2.096338 | -1.030483 | -1.722674 | 0.245522 |
2 | 2 | 14874 | 20040403 | 115.0 | 15 | 1.0 | 0.0 | 0.0 | 163 | 12.5 | 0.0 | 2806 | 0 | 0 | 20160402 | 6222 | 45.978359 | 4.823792 | 1.319524 | -0.998467 | -0.996911 | 0.251410 | 0.114912 | 0.165147 | 0.062173 | 0.027075 | -4.846749 | 1.803559 | 1.565330 | -0.832687 | -0.229963 |
3 | 3 | 71865 | 19960908 | 109.0 | 10 | 0.0 | 0.0 | 1.0 | 193 | 15.0 | 0.0 | 434 | 0 | 0 | 20160312 | 2400 | 45.687478 | 4.492574 | -0.050616 | 0.883600 | -2.228079 | 0.274293 | 0.110300 | 0.121964 | 0.033395 | 0.000000 | -4.509599 | 1.285940 | -0.501868 | -2.438353 | -0.478699 |
4 | 4 | 111080 | 20120103 | 110.0 | 5 | 1.0 | 0.0 | 0.0 | 68 | 5.0 | 0.0 | 6977 | 0 | 0 | 20160313 | 5200 | 44.383511 | 2.031433 | 0.572169 | -1.571239 | 2.246088 | 0.228036 | 0.073205 | 0.091880 | 0.078819 | 0.121534 | -1.896240 | 0.910783 | 0.931110 | 2.834518 | 1.923482 |
train.columns
Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType',
'gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode',
'seller', 'offerType', 'creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3',
'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12',
'v_13', 'v_14'],
dtype='object')
test.columns
Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType',
'gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode',
'seller', 'offerType', 'creatDate', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4',
'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13',
'v_14'],
dtype='object')
# 这里我包装了一个异常值处理的代码,可以随便调用。
def outliers_proc(data, col_name, scale=3):
"""
用于清洗异常值,默认用 box_plot(scale=3)进行清洗
:param data: 接收 pandas 数据格式
:param col_name: pandas 列名
:param scale: 尺度
:return:
"""
def box_plot_outliers(data_ser, box_scale):
"""
利用箱线图去除异常值
:param data_ser: 接收 pandas.Series 数据格式
:param box_scale: 箱线图尺度,
:return:
"""
iqr = box_scale * (data_ser.quantile(0.75) - data_ser.quantile(0.25))#表示box_scale倍的箱的高度
val_low = data_ser.quantile(0.25) - iqr
val_up = data_ser.quantile(0.75) + iqr
rule_low = (data_ser < val_low)#是一个bool值可以用来刷特征
rule_up = (data_ser > val_up)
return (rule_low, rule_up), (val_low, val_up)
data_n = data.copy()#先将数据拷贝一份
data_series = data_n[col_name]#选出指定的列
rule, value = box_plot_outliers(data_series, box_scale=scale)#获取异常值的范围
index = np.arange(data_series.shape[0])[rule[0] | rule[1]]#获取异常值的索引
print("Delete number is: {}".format(len(index)))
data_n = data_n.drop(index) #根据索引删除异常值
data_n.reset_index(drop=True, inplace=True)#因为中间删除了部分异常值,所以需要重置索引
print("Now column number is: {}".format(data_n.shape[0]))
index_low = np.arange(data_series.shape[0])[rule[0]]#选出值比较低的异常值的索引
outliers = data_series.iloc[index_low] #获取值比较低的异常值
print("Description of data less than the lower bound is:")
print(pd.Series(outliers).describe()) #获取值比较低的异常值的信息
index_up = np.arange(data_series.shape[0])[rule[1]]#获取值比较高的异常值的索引
outliers = data_series.iloc[index_up] #获取值比较高的异常值
print("Description of data larger than the upper bound is:")
print(pd.Series(outliers).describe()) #获取值比较高的异常值的信息
fig, ax = plt.subplots(1, 2, figsize=(10, 7))
sns.boxplot(y=data[col_name], data=data, palette="Set1", ax=ax[0])
sns.boxplot(y=data_n[col_name], data=data_n, palette="Set1", ax=ax[1])
return data_n
pd.set_option('display.max_columns', 100 )
train.describe()
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | regionCode | seller | offerType | creatDate | price | v_0 | v_1 | v_2 | v_3 | v_4 | v_5 | v_6 | v_7 | v_8 | v_9 | v_10 | v_11 | v_12 | v_13 | v_14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 150000.000000 | 150000.000000 | 1.500000e+05 | 149999.000000 | 150000.000000 | 145494.000000 | 141320.000000 | 144019.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.0 | 1.500000e+05 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 |
mean | 74999.500000 | 68349.172873 | 2.003417e+07 | 47.129021 | 8.052733 | 1.792369 | 0.375842 | 0.224943 | 119.316547 | 12.597160 | 2583.077267 | 0.000007 | 0.0 | 2.016033e+07 | 5923.327333 | 44.406268 | -0.044809 | 0.080765 | 0.078833 | 0.017875 | 0.248204 | 0.044923 | 0.124692 | 0.058144 | 0.061996 | -0.001000 | 0.009035 | 0.004813 | 0.000313 | -0.000688 |
std | 43301.414527 | 61103.875095 | 5.364988e+04 | 49.536040 | 7.864956 | 1.760640 | 0.548677 | 0.417546 | 177.168419 | 3.919576 | 1885.363218 | 0.002582 | 0.0 | 1.067328e+02 | 7501.998477 | 2.457548 | 3.641893 | 2.929618 | 2.026514 | 1.193661 | 0.045804 | 0.051743 | 0.201410 | 0.029186 | 0.035692 | 3.772386 | 3.286071 | 2.517478 | 1.288988 | 1.038685 |
min | 0.000000 | 0.000000 | 1.991000e+07 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.500000 | 0.000000 | 0.000000 | 0.0 | 2.015062e+07 | 11.000000 | 30.451976 | -4.295589 | -4.470671 | -7.275037 | -4.364565 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -9.168192 | -5.558207 | -9.639552 | -4.153899 | -6.546556 |
25% | 37499.750000 | 11156.000000 | 1.999091e+07 | 10.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 75.000000 | 12.500000 | 1018.000000 | 0.000000 | 0.0 | 2.016031e+07 | 1300.000000 | 43.135799 | -3.192349 | -0.970671 | -1.462580 | -0.921191 | 0.243615 | 0.000038 | 0.062474 | 0.035334 | 0.033930 | -3.722303 | -1.951543 | -1.871846 | -1.057789 | -0.437034 |
50% | 74999.500000 | 51638.000000 | 2.003091e+07 | 30.000000 | 6.000000 | 1.000000 | 0.000000 | 0.000000 | 110.000000 | 15.000000 | 2196.000000 | 0.000000 | 0.0 | 2.016032e+07 | 3250.000000 | 44.610266 | -3.052671 | -0.382947 | 0.099722 | -0.075910 | 0.257798 | 0.000812 | 0.095866 | 0.057014 | 0.058484 | 1.624076 | -0.358053 | -0.130753 | -0.036245 | 0.141246 |
75% | 112499.250000 | 118841.250000 | 2.007111e+07 | 66.000000 | 13.000000 | 3.000000 | 1.000000 | 0.000000 | 150.000000 | 15.000000 | 3843.000000 | 0.000000 | 0.0 | 2.016033e+07 | 7700.000000 | 46.004721 | 4.000670 | 0.241335 | 1.565838 | 0.868758 | 0.265297 | 0.102009 | 0.125243 | 0.079382 | 0.087491 | 2.844357 | 1.255022 | 1.776933 | 0.942813 | 0.680378 |
max | 149999.000000 | 196812.000000 | 2.015121e+07 | 247.000000 | 39.000000 | 7.000000 | 6.000000 | 1.000000 | 19312.000000 | 15.000000 | 8120.000000 | 1.000000 | 0.0 | 2.016041e+07 | 99999.000000 | 52.304178 | 7.320308 | 19.035496 | 9.854702 | 6.829352 | 0.291838 | 0.151420 | 1.404936 | 0.160791 | 0.222787 | 12.357011 | 18.819042 | 13.847792 | 11.147669 | 8.658418 |
# 我们可以删掉一些异常数据,以 power 为例。
# 这里删不删同学可以自行判断
# 但是要注意 test 的数据不能删 = = 不能掩耳盗铃是不是
fig, ax = plt.subplots(1,1, figsize=(10, 7))
#train.describe()
#sns.boxplot(y=data['power'],data=train, palette="Set1", ax=ax)
train = outliers_proc(train, 'power', scale=3)
pd.set_option('display.max_columns', 100 )
train.describe()
Delete number is: 963
Now column number is: 149037
Description of data less than the lower bound is:
count 0.0
mean NaN
std NaN
min NaN
25% NaN
50% NaN
75% NaN
max NaN
Name: power, dtype: float64
Description of data larger than the upper bound is:
count 963.000000
mean 846.836968
std 1929.418081
min 376.000000
25% 400.000000
50% 436.000000
75% 514.000000
max 19312.000000
Name: power, dtype: float64
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | regionCode | seller | offerType | creatDate | price | v_0 | v_1 | v_2 | v_3 | v_4 | v_5 | v_6 | v_7 | v_8 | v_9 | v_10 | v_11 | v_12 | v_13 | v_14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 149037.000000 | 149037.000000 | 1.490370e+05 | 149036.000000 | 149037.000000 | 144543.000000 | 140405.000000 | 143083.000000 | 149037.000000 | 149037.000000 | 149037.000000 | 149037.000000 | 149037.0 | 1.490370e+05 | 149037.000000 | 149037.000000 | 149037.000000 | 149037.000000 | 149037.000000 | 149037.000000 | 149037.000000 | 149037.000000 | 149037.000000 | 149037.000000 | 149037.000000 | 149037.000000 | 149037.000000 | 149037.000000 | 149037.000000 | 149037.000000 |
mean | 75000.810040 | 68266.301730 | 2.003396e+07 | 46.969712 | 8.028973 | 1.785503 | 0.377380 | 0.221564 | 114.615686 | 12.611959 | 2583.189544 | 0.000007 | 0.0 | 2.016033e+07 | 5759.707328 | 44.386358 | -0.040915 | 0.077332 | 0.095597 | 0.024832 | 0.248081 | 0.044970 | 0.124692 | 0.057905 | 0.062213 | 0.004712 | 0.022959 | -0.017478 | 0.006048 | -0.000010 |
std | 43312.158963 | 61114.029665 | 5.361493e+04 | 49.347667 | 7.845709 | 1.754134 | 0.548392 | 0.415300 | 64.189762 | 3.909222 | 1885.675848 | 0.002590 | 0.0 | 1.070463e+02 | 6998.871286 | 2.445414 | 3.639592 | 2.934615 | 2.016049 | 1.191609 | 0.045855 | 0.051710 | 0.201773 | 0.029033 | 0.035631 | 3.771117 | 3.284766 | 2.502568 | 1.288631 | 1.034682 |
min | 0.000000 | 0.000000 | 1.991000e+07 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.500000 | 0.000000 | 0.000000 | 0.0 | 2.015062e+07 | 11.000000 | 30.451976 | -4.295589 | -4.470671 | -7.275037 | -4.364565 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -8.798810 | -5.403044 | -9.639552 | -4.153899 | -6.546556 |
25% | 37485.000000 | 11093.000000 | 1.999091e+07 | 10.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 75.000000 | 12.500000 | 1018.000000 | 0.000000 | 0.0 | 2.016031e+07 | 1300.000000 | 43.127305 | -3.191726 | -0.973892 | -1.438553 | -0.913525 | 0.243519 | 0.000035 | 0.062308 | 0.035211 | 0.034192 | -3.720560 | -1.938099 | -1.880377 | -1.054625 | -0.435075 |
50% | 74985.000000 | 51489.000000 | 2.003091e+07 | 30.000000 | 6.000000 | 1.000000 | 0.000000 | 0.000000 | 109.000000 | 15.000000 | 2196.000000 | 0.000000 | 0.0 | 2.016032e+07 | 3200.000000 | 44.595651 | -3.051661 | -0.388056 | 0.114803 | -0.067013 | 0.257691 | 0.000807 | 0.095763 | 0.056789 | 0.058789 | 1.637443 | -0.347436 | -0.148725 | -0.028004 | 0.140485 |
75% | 112532.000000 | 118779.000000 | 2.007110e+07 | 66.000000 | 13.000000 | 3.000000 | 1.000000 | 0.000000 | 150.000000 | 15.000000 | 3843.000000 | 0.000000 | 0.0 | 2.016033e+07 | 7500.000000 | 45.979786 | 4.000020 | 0.234213 | 1.573818 | 0.873620 | 0.265204 | 0.101998 | 0.125148 | 0.079051 | 0.087643 | 2.849450 | 1.262799 | 1.747968 | 0.947472 | 0.678217 |
max | 149999.000000 | 196812.000000 | 2.015121e+07 | 247.000000 | 39.000000 | 7.000000 | 6.000000 | 1.000000 | 375.000000 | 15.000000 | 8120.000000 | 1.000000 | 0.0 | 2.016041e+07 | 99999.000000 | 52.304178 | 7.320308 | 19.035496 | 9.854702 | 6.829352 | 0.291838 | 0.151420 | 1.404936 | 0.160791 | 0.222787 | 12.357011 | 18.819042 | 13.847792 | 11.147669 | 8.658418 |
# 训练集和测试集放在一起,方便构造特征
train['train']=1
test['train']=0
data = pd.concat([train, test], ignore_index=True, sort=False)
data.head().append(data.tail())
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | notRepairedDamage | regionCode | seller | offerType | creatDate | price | v_0 | v_1 | v_2 | v_3 | v_4 | v_5 | v_6 | v_7 | v_8 | v_9 | v_10 | v_11 | v_12 | v_13 | v_14 | train | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 736 | 20040402 | 30.0 | 6 | 1.0 | 0.0 | 0.0 | 60 | 12.5 | 0.0 | 1046 | 0 | 0 | 20160404 | 1850.0 | 43.357796 | 3.966344 | 0.050257 | 2.159744 | 1.143786 | 0.235676 | 0.101988 | 0.129549 | 0.022816 | 0.097462 | -2.881803 | 2.804097 | -2.420821 | 0.795292 | 0.914762 | 1 |
1 | 1 | 2262 | 20030301 | 40.0 | 1 | 2.0 | 0.0 | 0.0 | 0 | 15.0 | - | 4366 | 0 | 0 | 20160309 | 3600.0 | 45.305273 | 5.236112 | 0.137925 | 1.380657 | -1.422165 | 0.264777 | 0.121004 | 0.135731 | 0.026597 | 0.020582 | -4.900482 | 2.096338 | -1.030483 | -1.722674 | 0.245522 | 1 |
2 | 2 | 14874 | 20040403 | 115.0 | 15 | 1.0 | 0.0 | 0.0 | 163 | 12.5 | 0.0 | 2806 | 0 | 0 | 20160402 | 6222.0 | 45.978359 | 4.823792 | 1.319524 | -0.998467 | -0.996911 | 0.251410 | 0.114912 | 0.165147 | 0.062173 | 0.027075 | -4.846749 | 1.803559 | 1.565330 | -0.832687 | -0.229963 | 1 |
3 | 3 | 71865 | 19960908 | 109.0 | 10 | 0.0 | 0.0 | 1.0 | 193 | 15.0 | 0.0 | 434 | 0 | 0 | 20160312 | 2400.0 | 45.687478 | 4.492574 | -0.050616 | 0.883600 | -2.228079 | 0.274293 | 0.110300 | 0.121964 | 0.033395 | 0.000000 | -4.509599 | 1.285940 | -0.501868 | -2.438353 | -0.478699 | 1 |
4 | 4 | 111080 | 20120103 | 110.0 | 5 | 1.0 | 0.0 | 0.0 | 68 | 5.0 | 0.0 | 6977 | 0 | 0 | 20160313 | 5200.0 | 44.383511 | 2.031433 | 0.572169 | -1.571239 | 2.246088 | 0.228036 | 0.073205 | 0.091880 | 0.078819 | 0.121534 | -1.896240 | 0.910783 | 0.931110 | 2.834518 | 1.923482 | 1 |
199032 | 199995 | 20903 | 19960503 | 4.0 | 4 | 4.0 | 0.0 | 0.0 | 116 | 15.0 | 0.0 | 3219 | 0 | 0 | 20160320 | NaN | 45.621391 | 5.958453 | -0.918571 | 0.774826 | -2.021739 | 0.284664 | 0.130044 | 0.049833 | 0.028807 | 0.004616 | -5.978511 | 1.303174 | -1.207191 | -1.981240 | -0.357695 | 0 |
199033 | 199996 | 708 | 19991011 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 75 | 15.0 | 0.0 | 1857 | 0 | 0 | 20160329 | NaN | 43.935162 | 4.476841 | -0.841710 | 1.328253 | -1.292675 | 0.268101 | 0.108095 | 0.066039 | 0.025468 | 0.025971 | -3.913825 | 1.759524 | -2.075658 | -1.154847 | 0.169073 | 0 |
199034 | 199997 | 6693 | 20040412 | 49.0 | 1 | 0.0 | 1.0 | 1.0 | 224 | 15.0 | 0.0 | 3452 | 0 | 0 | 20160305 | NaN | 46.537137 | 4.170806 | 0.388595 | -0.704689 | -1.480710 | 0.269432 | 0.105724 | 0.117652 | 0.057479 | 0.015669 | -4.639065 | 0.654713 | 1.137756 | -1.390531 | 0.254420 | 0 |
199035 | 199998 | 96900 | 20020008 | 27.0 | 1 | 0.0 | 0.0 | 1.0 | 334 | 15.0 | 0.0 | 1998 | 0 | 0 | 20160404 | NaN | 46.771359 | -3.296814 | 0.243566 | -1.277411 | -0.404881 | 0.261152 | 0.000490 | 0.137366 | 0.086216 | 0.051383 | 1.833504 | -2.828687 | 2.465630 | -0.911682 | -2.057353 | 0 |
199036 | 199999 | 193384 | 20041109 | 166.0 | 6 | 1.0 | NaN | 1.0 | 68 | 9.0 | 0.0 | 3276 | 0 | 0 | 20160322 | NaN | 43.731010 | -3.121867 | 0.027348 | -0.808914 | 2.116551 | 0.228730 | 0.000300 | 0.103534 | 0.080625 | 0.124264 | 2.914571 | -1.135270 | 0.547628 | 2.094057 | -1.552150 | 0 |
# 使用时间:data['creatDate'] - data['regDate'],反应汽车使用时间,一般来说价格与使用时间成反比
# 不过要注意,数据里有时间出错的格式,所以我们需要 errors='coerce'
data['used_time'] = (pd.to_datetime(data['creatDate'], format='%Y%m%d', errors='coerce') - #
pd.to_datetime(data['regDate'], format='%Y%m%d', errors='coerce')).dt.days#
#data['test1']=pd.to_datetime(data['creatDate'], format='%Y%m%d', errors='coerce')
#data['test2']=pd.to_datetime(data['regDate'], format='%Y%m%d', errors='coerce')
#data['test']=(pd.to_datetime(data['creatDate'], format='%Y%m%d', errors='coerce') -pd.to_datetime(data['regDate'], format='%Y%m%d', errors='coerce'))
data.head().append(data.tail())
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | notRepairedDamage | regionCode | seller | offerType | creatDate | price | v_0 | v_1 | v_2 | v_3 | v_4 | v_5 | v_6 | v_7 | v_8 | v_9 | v_10 | v_11 | v_12 | v_13 | v_14 | train | used_time | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 736 | 20040402 | 30.0 | 6 | 1.0 | 0.0 | 0.0 | 60 | 12.5 | 0.0 | 1046 | 0 | 0 | 20160404 | 1850.0 | 43.357796 | 3.966344 | 0.050257 | 2.159744 | 1.143786 | 0.235676 | 0.101988 | 0.129549 | 0.022816 | 0.097462 | -2.881803 | 2.804097 | -2.420821 | 0.795292 | 0.914762 | 1 | 4385.0 |
1 | 1 | 2262 | 20030301 | 40.0 | 1 | 2.0 | 0.0 | 0.0 | 0 | 15.0 | - | 4366 | 0 | 0 | 20160309 | 3600.0 | 45.305273 | 5.236112 | 0.137925 | 1.380657 | -1.422165 | 0.264777 | 0.121004 | 0.135731 | 0.026597 | 0.020582 | -4.900482 | 2.096338 | -1.030483 | -1.722674 | 0.245522 | 1 | 4757.0 |
2 | 2 | 14874 | 20040403 | 115.0 | 15 | 1.0 | 0.0 | 0.0 | 163 | 12.5 | 0.0 | 2806 | 0 | 0 | 20160402 | 6222.0 | 45.978359 | 4.823792 | 1.319524 | -0.998467 | -0.996911 | 0.251410 | 0.114912 | 0.165147 | 0.062173 | 0.027075 | -4.846749 | 1.803559 | 1.565330 | -0.832687 | -0.229963 | 1 | 4382.0 |
3 | 3 | 71865 | 19960908 | 109.0 | 10 | 0.0 | 0.0 | 1.0 | 193 | 15.0 | 0.0 | 434 | 0 | 0 | 20160312 | 2400.0 | 45.687478 | 4.492574 | -0.050616 | 0.883600 | -2.228079 | 0.274293 | 0.110300 | 0.121964 | 0.033395 | 0.000000 | -4.509599 | 1.285940 | -0.501868 | -2.438353 | -0.478699 | 1 | 7125.0 |
4 | 4 | 111080 | 20120103 | 110.0 | 5 | 1.0 | 0.0 | 0.0 | 68 | 5.0 | 0.0 | 6977 | 0 | 0 | 20160313 | 5200.0 | 44.383511 | 2.031433 | 0.572169 | -1.571239 | 2.246088 | 0.228036 | 0.073205 | 0.091880 | 0.078819 | 0.121534 | -1.896240 | 0.910783 | 0.931110 | 2.834518 | 1.923482 | 1 | 1531.0 |
199032 | 199995 | 20903 | 19960503 | 4.0 | 4 | 4.0 | 0.0 | 0.0 | 116 | 15.0 | 0.0 | 3219 | 0 | 0 | 20160320 | NaN | 45.621391 | 5.958453 | -0.918571 | 0.774826 | -2.021739 | 0.284664 | 0.130044 | 0.049833 | 0.028807 | 0.004616 | -5.978511 | 1.303174 | -1.207191 | -1.981240 | -0.357695 | 0 | 7261.0 |
199033 | 199996 | 708 | 19991011 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 75 | 15.0 | 0.0 | 1857 | 0 | 0 | 20160329 | NaN | 43.935162 | 4.476841 | -0.841710 | 1.328253 | -1.292675 | 0.268101 | 0.108095 | 0.066039 | 0.025468 | 0.025971 | -3.913825 | 1.759524 | -2.075658 | -1.154847 | 0.169073 | 0 | 6014.0 |
199034 | 199997 | 6693 | 20040412 | 49.0 | 1 | 0.0 | 1.0 | 1.0 | 224 | 15.0 | 0.0 | 3452 | 0 | 0 | 20160305 | NaN | 46.537137 | 4.170806 | 0.388595 | -0.704689 | -1.480710 | 0.269432 | 0.105724 | 0.117652 | 0.057479 | 0.015669 | -4.639065 | 0.654713 | 1.137756 | -1.390531 | 0.254420 | 0 | 4345.0 |
199035 | 199998 | 96900 | 20020008 | 27.0 | 1 | 0.0 | 0.0 | 1.0 | 334 | 15.0 | 0.0 | 1998 | 0 | 0 | 20160404 | NaN | 46.771359 | -3.296814 | 0.243566 | -1.277411 | -0.404881 | 0.261152 | 0.000490 | 0.137366 | 0.086216 | 0.051383 | 1.833504 | -2.828687 | 2.465630 | -0.911682 | -2.057353 | 0 | NaN |
199036 | 199999 | 193384 | 20041109 | 166.0 | 6 | 1.0 | NaN | 1.0 | 68 | 9.0 | 0.0 | 3276 | 0 | 0 | 20160322 | NaN | 43.731010 | -3.121867 | 0.027348 | -0.808914 | 2.116551 | 0.228730 | 0.000300 | 0.103534 | 0.080625 | 0.124264 | 2.914571 | -1.135270 | 0.547628 | 2.094057 | -1.552150 | 0 | 4151.0 |
#data['test'] = data['test'].fillna(data['test'].mean())#好像是这个
data['used_time'] = data['used_time'].fillna(data['used_time'].mean())
data.head().append(data.tail())
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | notRepairedDamage | regionCode | seller | offerType | creatDate | price | v_0 | v_1 | v_2 | v_3 | v_4 | v_5 | v_6 | v_7 | v_8 | v_9 | v_10 | v_11 | v_12 | v_13 | v_14 | train | used_time | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 736 | 20040402 | 30.0 | 6 | 1.0 | 0.0 | 0.0 | 60 | 12.5 | 0.0 | 1046 | 0 | 0 | 20160404 | 1850.0 | 43.357796 | 3.966344 | 0.050257 | 2.159744 | 1.143786 | 0.235676 | 0.101988 | 0.129549 | 0.022816 | 0.097462 | -2.881803 | 2.804097 | -2.420821 | 0.795292 | 0.914762 | 1 | 4385.000000 |
1 | 1 | 2262 | 20030301 | 40.0 | 1 | 2.0 | 0.0 | 0.0 | 0 | 15.0 | - | 4366 | 0 | 0 | 20160309 | 3600.0 | 45.305273 | 5.236112 | 0.137925 | 1.380657 | -1.422165 | 0.264777 | 0.121004 | 0.135731 | 0.026597 | 0.020582 | -4.900482 | 2.096338 | -1.030483 | -1.722674 | 0.245522 | 1 | 4757.000000 |
2 | 2 | 14874 | 20040403 | 115.0 | 15 | 1.0 | 0.0 | 0.0 | 163 | 12.5 | 0.0 | 2806 | 0 | 0 | 20160402 | 6222.0 | 45.978359 | 4.823792 | 1.319524 | -0.998467 | -0.996911 | 0.251410 | 0.114912 | 0.165147 | 0.062173 | 0.027075 | -4.846749 | 1.803559 | 1.565330 | -0.832687 | -0.229963 | 1 | 4382.000000 |
3 | 3 | 71865 | 19960908 | 109.0 | 10 | 0.0 | 0.0 | 1.0 | 193 | 15.0 | 0.0 | 434 | 0 | 0 | 20160312 | 2400.0 | 45.687478 | 4.492574 | -0.050616 | 0.883600 | -2.228079 | 0.274293 | 0.110300 | 0.121964 | 0.033395 | 0.000000 | -4.509599 | 1.285940 | -0.501868 | -2.438353 | -0.478699 | 1 | 7125.000000 |
4 | 4 | 111080 | 20120103 | 110.0 | 5 | 1.0 | 0.0 | 0.0 | 68 | 5.0 | 0.0 | 6977 | 0 | 0 | 20160313 | 5200.0 | 44.383511 | 2.031433 | 0.572169 | -1.571239 | 2.246088 | 0.228036 | 0.073205 | 0.091880 | 0.078819 | 0.121534 | -1.896240 | 0.910783 | 0.931110 | 2.834518 | 1.923482 | 1 | 1531.000000 |
199032 | 199995 | 20903 | 19960503 | 4.0 | 4 | 4.0 | 0.0 | 0.0 | 116 | 15.0 | 0.0 | 3219 | 0 | 0 | 20160320 | NaN | 45.621391 | 5.958453 | -0.918571 | 0.774826 | -2.021739 | 0.284664 | 0.130044 | 0.049833 | 0.028807 | 0.004616 | -5.978511 | 1.303174 | -1.207191 | -1.981240 | -0.357695 | 0 | 7261.000000 |
199033 | 199996 | 708 | 19991011 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 75 | 15.0 | 0.0 | 1857 | 0 | 0 | 20160329 | NaN | 43.935162 | 4.476841 | -0.841710 | 1.328253 | -1.292675 | 0.268101 | 0.108095 | 0.066039 | 0.025468 | 0.025971 | -3.913825 | 1.759524 | -2.075658 | -1.154847 | 0.169073 | 0 | 6014.000000 |
199034 | 199997 | 6693 | 20040412 | 49.0 | 1 | 0.0 | 1.0 | 1.0 | 224 | 15.0 | 0.0 | 3452 | 0 | 0 | 20160305 | NaN | 46.537137 | 4.170806 | 0.388595 | -0.704689 | -1.480710 | 0.269432 | 0.105724 | 0.117652 | 0.057479 | 0.015669 | -4.639065 | 0.654713 | 1.137756 | -1.390531 | 0.254420 | 0 | 4345.000000 |
199035 | 199998 | 96900 | 20020008 | 27.0 | 1 | 0.0 | 0.0 | 1.0 | 334 | 15.0 | 0.0 | 1998 | 0 | 0 | 20160404 | NaN | 46.771359 | -3.296814 | 0.243566 | -1.277411 | -0.404881 | 0.261152 | 0.000490 | 0.137366 | 0.086216 | 0.051383 | 1.833504 | -2.828687 | 2.465630 | -0.911682 | -2.057353 | 0 | 4441.030582 |
199036 | 199999 | 193384 | 20041109 | 166.0 | 6 | 1.0 | NaN | 1.0 | 68 | 9.0 | 0.0 | 3276 | 0 | 0 | 20160322 | NaN | 43.731010 | -3.121867 | 0.027348 | -0.808914 | 2.116551 | 0.228730 | 0.000300 | 0.103534 | 0.080625 | 0.124264 | 2.914571 | -1.135270 | 0.547628 | 2.094057 | -1.552150 | 0 | 4151.000000 |
# 看一下空数据,有 15k 个样本的时间是有问题的,我们可以选择删除,也可以选择放着。
# 但是这里不建议删除,因为删除缺失数据占总样本量过大,7.5%
# 我们可以先放着,因为如果我们 XGBoost 之类的决策树,其本身就能处理缺失值,所以可以不用管;
data['used_time'].isnull().sum()
0
# 从邮编中提取城市信息,因为是德国的数据,所以参考德国的邮编,相当于加入了先验知识
data['city'] = data['regionCode'].apply(lambda x : int(str(x)[:-3])if len(str(x))> 3 else 0)
data.info()
data.head().append(data.tail())
RangeIndex: 199037 entries, 0 to 199036
Data columns (total 34 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 SaleID 199037 non-null int64
1 name 199037 non-null int64
2 regDate 199037 non-null int64
3 model 199036 non-null float64
4 brand 199037 non-null int64
5 bodyType 193130 non-null float64
6 fuelType 187512 non-null float64
7 gearbox 191173 non-null float64
8 power 199037 non-null int64
9 kilometer 199037 non-null float64
10 notRepairedDamage 199037 non-null object
11 regionCode 199037 non-null int64
12 seller 199037 non-null int64
13 offerType 199037 non-null int64
14 creatDate 199037 non-null int64
15 price 149037 non-null float64
16 v_0 199037 non-null float64
17 v_1 199037 non-null float64
18 v_2 199037 non-null float64
19 v_3 199037 non-null float64
20 v_4 199037 non-null float64
21 v_5 199037 non-null float64
22 v_6 199037 non-null float64
23 v_7 199037 non-null float64
24 v_8 199037 non-null float64
25 v_9 199037 non-null float64
26 v_10 199037 non-null float64
27 v_11 199037 non-null float64
28 v_12 199037 non-null float64
29 v_13 199037 non-null float64
30 v_14 199037 non-null float64
31 train 199037 non-null int64
32 used_time 199037 non-null float64
33 city 199037 non-null int64
dtypes: float64(22), int64(11), object(1)
memory usage: 51.6+ MB
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | notRepairedDamage | regionCode | seller | offerType | creatDate | price | v_0 | v_1 | v_2 | v_3 | v_4 | v_5 | v_6 | v_7 | v_8 | v_9 | v_10 | v_11 | v_12 | v_13 | v_14 | train | used_time | city | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 736 | 20040402 | 30.0 | 6 | 1.0 | 0.0 | 0.0 | 60 | 12.5 | 0.0 | 1046 | 0 | 0 | 20160404 | 1850.0 | 43.357796 | 3.966344 | 0.050257 | 2.159744 | 1.143786 | 0.235676 | 0.101988 | 0.129549 | 0.022816 | 0.097462 | -2.881803 | 2.804097 | -2.420821 | 0.795292 | 0.914762 | 1 | 4385.000000 | 1 |
1 | 1 | 2262 | 20030301 | 40.0 | 1 | 2.0 | 0.0 | 0.0 | 0 | 15.0 | - | 4366 | 0 | 0 | 20160309 | 3600.0 | 45.305273 | 5.236112 | 0.137925 | 1.380657 | -1.422165 | 0.264777 | 0.121004 | 0.135731 | 0.026597 | 0.020582 | -4.900482 | 2.096338 | -1.030483 | -1.722674 | 0.245522 | 1 | 4757.000000 | 4 |
2 | 2 | 14874 | 20040403 | 115.0 | 15 | 1.0 | 0.0 | 0.0 | 163 | 12.5 | 0.0 | 2806 | 0 | 0 | 20160402 | 6222.0 | 45.978359 | 4.823792 | 1.319524 | -0.998467 | -0.996911 | 0.251410 | 0.114912 | 0.165147 | 0.062173 | 0.027075 | -4.846749 | 1.803559 | 1.565330 | -0.832687 | -0.229963 | 1 | 4382.000000 | 2 |
3 | 3 | 71865 | 19960908 | 109.0 | 10 | 0.0 | 0.0 | 1.0 | 193 | 15.0 | 0.0 | 434 | 0 | 0 | 20160312 | 2400.0 | 45.687478 | 4.492574 | -0.050616 | 0.883600 | -2.228079 | 0.274293 | 0.110300 | 0.121964 | 0.033395 | 0.000000 | -4.509599 | 1.285940 | -0.501868 | -2.438353 | -0.478699 | 1 | 7125.000000 | 0 |
4 | 4 | 111080 | 20120103 | 110.0 | 5 | 1.0 | 0.0 | 0.0 | 68 | 5.0 | 0.0 | 6977 | 0 | 0 | 20160313 | 5200.0 | 44.383511 | 2.031433 | 0.572169 | -1.571239 | 2.246088 | 0.228036 | 0.073205 | 0.091880 | 0.078819 | 0.121534 | -1.896240 | 0.910783 | 0.931110 | 2.834518 | 1.923482 | 1 | 1531.000000 | 6 |
199032 | 199995 | 20903 | 19960503 | 4.0 | 4 | 4.0 | 0.0 | 0.0 | 116 | 15.0 | 0.0 | 3219 | 0 | 0 | 20160320 | NaN | 45.621391 | 5.958453 | -0.918571 | 0.774826 | -2.021739 | 0.284664 | 0.130044 | 0.049833 | 0.028807 | 0.004616 | -5.978511 | 1.303174 | -1.207191 | -1.981240 | -0.357695 | 0 | 7261.000000 | 3 |
199033 | 199996 | 708 | 19991011 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 75 | 15.0 | 0.0 | 1857 | 0 | 0 | 20160329 | NaN | 43.935162 | 4.476841 | -0.841710 | 1.328253 | -1.292675 | 0.268101 | 0.108095 | 0.066039 | 0.025468 | 0.025971 | -3.913825 | 1.759524 | -2.075658 | -1.154847 | 0.169073 | 0 | 6014.000000 | 1 |
199034 | 199997 | 6693 | 20040412 | 49.0 | 1 | 0.0 | 1.0 | 1.0 | 224 | 15.0 | 0.0 | 3452 | 0 | 0 | 20160305 | NaN | 46.537137 | 4.170806 | 0.388595 | -0.704689 | -1.480710 | 0.269432 | 0.105724 | 0.117652 | 0.057479 | 0.015669 | -4.639065 | 0.654713 | 1.137756 | -1.390531 | 0.254420 | 0 | 4345.000000 | 3 |
199035 | 199998 | 96900 | 20020008 | 27.0 | 1 | 0.0 | 0.0 | 1.0 | 334 | 15.0 | 0.0 | 1998 | 0 | 0 | 20160404 | NaN | 46.771359 | -3.296814 | 0.243566 | -1.277411 | -0.404881 | 0.261152 | 0.000490 | 0.137366 | 0.086216 | 0.051383 | 1.833504 | -2.828687 | 2.465630 | -0.911682 | -2.057353 | 0 | 4441.030582 | 1 |
199036 | 199999 | 193384 | 20041109 | 166.0 | 6 | 1.0 | NaN | 1.0 | 68 | 9.0 | 0.0 | 3276 | 0 | 0 | 20160322 | NaN | 43.731010 | -3.121867 | 0.027348 | -0.808914 | 2.116551 | 0.228730 | 0.000300 | 0.103534 | 0.080625 | 0.124264 | 2.914571 | -1.135270 | 0.547628 | 2.094057 | -1.552150 | 0 | 4151.000000 | 3 |
#train.groupby("brand")["price"].describe()
train.groupby("brand").describe()
SaleID | name | regDate | model | bodyType | fuelType | gearbox | ... | v_9 | v_10 | v_11 | v_12 | v_13 | v_14 | train | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | mean | std | min | 25% | 50% | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | count | mean | ... | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | |
brand | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0 | 31429.0 | 75135.898979 | 43216.003175 | 14.0 | 37611.00 | 75259.0 | 112411.00 | 149988.0 | 31429.0 | 68280.543002 | 61481.259318 | 0.0 | 10298.00 | 51720.0 | 119576.00 | 196811.0 | 31429.0 | 2.002938e+07 | 57745.773061 | 19910001.0 | 19981111.00 | 20030204.0 | 20071101.00 | 20151208.0 | 31429.0 | 26.238124 | 34.731275 | 0.0 | 0.0 | 8.0 | 44.0 | 230.0 | 30295.0 | 1.612642 | 1.576731 | 0.0 | 0.0 | 1.0 | 3.0 | 7.0 | 29572.0 | 0.425064 | 0.534715 | 0.0 | 0.0 | 0.0 | 1.0 | 6.0 | 30017.0 | 0.143985 | ... | 0.070044 | 0.181968 | 31429.0 | 0.202655 | 3.819637 | -8.251269 | -3.755349 | 1.812951 | 2.969662 | 12.285299 | 31429.0 | 0.130102 | 3.414541 | -5.114536 | -1.856988 | -0.342067 | 1.237629 | 18.379089 | 31429.0 | 0.010158 | 2.442354 | -8.679290 | -1.839407 | -0.197639 | 1.741408 | 12.157611 | 31429.0 | -0.226931 | 0.860948 | -2.125022 | -1.041485 | -0.223171 | 0.366677 | 2.289283 | 31429.0 | 0.258134 | 0.921088 | -3.960301 | -0.080196 | 0.288050 | 0.872530 | 2.300021 | 31429.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
1 | 13656.0 | 74957.974297 | 43258.259449 | 1.0 | 37470.00 | 74582.5 | 112659.50 | 149994.0 | 13656.0 | 62886.934827 | 61452.088129 | 45.0 | 6693.00 | 40935.0 | 113516.50 | 196781.0 | 13656.0 | 2.004154e+07 | 57048.978974 | 19910001.0 | 20000102.00 | 20050111.0 | 20081104.00 | 20151210.0 | 13656.0 | 53.608890 | 29.804153 | 1.0 | 40.0 | 49.0 | 65.0 | 247.0 | 13359.0 | 1.591586 | 1.634055 | 0.0 | 0.0 | 2.0 | 2.0 | 7.0 | 13043.0 | 0.524649 | 0.531108 | 0.0 | 0.0 | 1.0 | 1.0 | 6.0 | 13210.0 | 0.340424 | ... | 0.043838 | 0.193907 | 13656.0 | -0.931170 | 3.883604 | -8.051120 | -4.793600 | 0.988510 | 2.287106 | 12.103094 | 13656.0 | -0.416706 | 2.959515 | -5.161408 | -2.353655 | -0.733364 | 1.107900 | 18.408500 | 13656.0 | 1.170914 | 2.357446 | -5.969378 | -0.683091 | 1.265707 | 2.820971 | 12.973057 | 13656.0 | -1.061234 | 0.851262 | -2.521308 | -1.719965 | -1.109461 | -0.533658 | 4.907355 | 13656.0 | -0.049340 | 0.875329 | -3.683050 | -0.452126 | 0.086691 | 0.566010 | 2.021592 | 13656.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
2 | 318.0 | 78087.817610 | 43396.063117 | 1460.0 | 39411.25 | 80894.5 | 116589.50 | 148937.0 | 318.0 | 78339.481132 | 55498.550938 | 855.0 | 29776.00 | 70464.5 | 117160.00 | 196276.0 | 318.0 | 2.004285e+07 | 62012.317580 | 19910304.0 | 20000028.00 | 20050158.5 | 20090405.25 | 20151112.0 | 318.0 | 84.937107 | 96.134973 | 1.0 | 2.0 | 19.0 | 197.0 | 207.0 | 314.0 | 5.990446 | 0.169300 | 3.0 | 6.0 | 6.0 | 6.0 | 6.0 | 306.0 | 0.725490 | 0.619204 | 0.0 | 0.0 | 1.0 | 1.0 | 2.0 | 312.0 | 0.753205 | ... | 0.063097 | 0.132520 | 318.0 | -1.465617 | 3.744038 | -7.448232 | -5.287060 | 0.740907 | 1.678765 | 11.312121 | 318.0 | -1.997598 | 2.749805 | -5.040389 | -3.657794 | -2.923828 | -0.379690 | 17.834444 | 318.0 | 1.534046 | 1.989458 | -6.188436 | 0.151433 | 1.262676 | 2.837065 | 8.071152 | 318.0 | -0.528789 | 0.664283 | -1.792006 | -1.012964 | -0.696783 | -0.122290 | 1.082124 | 318.0 | -1.543233 | 1.369426 | -4.953890 | -2.186592 | -1.535352 | -0.716360 | 0.996607 | 318.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
3 | 2461.0 | 76863.838683 | 43479.432350 | 25.0 | 38348.00 | 78782.0 | 115192.00 | 149975.0 | 2461.0 | 70155.264120 | 59515.861583 | 4.0 | 15034.00 | 56292.0 | 116235.00 | 196716.0 | 2461.0 | 2.006858e+07 | 44660.632387 | 19920107.0 | 20040206.00 | 20070705.0 | 20100910.00 | 20151208.0 | 2461.0 | 58.330760 | 50.853115 | 1.0 | 3.0 | 87.0 | 87.0 | 193.0 | 2422.0 | 1.678778 | 1.275721 | 0.0 | 1.0 | 2.0 | 2.0 | 7.0 | 2374.0 | 0.396799 | 0.512061 | 0.0 | 0.0 | 0.0 | 1.0 | 3.0 | 2413.0 | 0.120182 | ... | 0.102473 | 0.190810 | 2461.0 | -0.587122 | 3.517722 | -7.081028 | -4.048563 | 1.250506 | 2.386922 | 11.367486 | 2461.0 | -0.329133 | 2.608295 | -4.698429 | -2.096946 | -0.678162 | 1.083719 | 17.760794 | 2461.0 | 1.075896 | 2.009365 | -5.889687 | -0.304789 | 1.022185 | 2.460299 | 10.826623 | 2461.0 | 0.920820 | 0.938277 | -1.944913 | 0.419845 | 0.990531 | 1.442011 | 3.774539 | 2461.0 | -1.307435 | 1.164774 | -3.975906 | -2.162526 | -1.171653 | -0.430467 | 1.954672 | 2461.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
4 | 16575.0 | 74424.163801 | 43141.145888 | 6.0 | 37016.50 | 74228.0 | 111890.00 | 149986.0 | 16575.0 | 64769.137376 | 62658.750071 | 18.0 | 5555.00 | 45038.0 | 117141.50 | 196812.0 | 16575.0 | 2.003306e+07 | 53330.969493 | 19910001.0 | 19990706.00 | 20040012.0 | 20071009.00 | 20151212.0 | 16575.0 | 18.339970 | 30.471381 | 1.0 | 4.0 | 4.0 | 13.0 | 245.0 | 16210.0 | 1.648982 | 1.986454 | 0.0 | 0.0 | 0.0 | 2.0 | 7.0 | 15762.0 | 0.440426 | 0.541446 | 0.0 | 0.0 | 0.0 | 1.0 | 6.0 | 16130.0 | 0.356293 | ... | 0.036444 | 0.138871 | 16575.0 | -0.644018 | 3.854532 | -8.192090 | -4.906936 | 1.398704 | 2.331440 | 12.010943 | 16575.0 | -0.641841 | 2.964420 | -5.088259 | -2.582502 | -1.397307 | 1.004037 | 18.341317 | 16575.0 | 0.968379 | 2.217971 | -7.882603 | -0.743230 | 0.931882 | 2.488241 | 13.083661 | 16575.0 | -1.268478 | 0.589530 | -2.908192 | -1.820144 | -1.293616 | -0.770327 | 1.549990 | 16575.0 | -0.000511 | 0.672059 | -4.476282 | -0.350878 | 0.005552 | 0.455194 | 2.166396 | 16575.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
5 | 4662.0 | 74953.461819 | 43625.801338 | 4.0 | 37394.75 | 76202.5 | 112649.75 | 149969.0 | 4662.0 | 66926.721793 | 60586.323126 | 6.0 | 10626.50 | 48965.0 | 117049.50 | 196770.0 | 4662.0 | 2.003742e+07 | 44883.289281 | 19910008.0 | 20010004.00 | 20040206.5 | 20070404.50 | 20150907.0 | 4662.0 | 25.874517 | 40.397238 | 1.0 | 5.0 | 5.0 | 19.0 | 117.0 | 4542.0 | 1.999339 | 1.520976 | 0.0 | 1.0 | 1.0 | 4.0 | 7.0 | 4386.0 | 0.262882 | 0.482746 | 0.0 | 0.0 | 0.0 | 0.0 | 6.0 | 4507.0 | 0.060794 | ... | 0.093820 | 0.173234 | 4662.0 | 0.296861 | 3.435354 | -6.538353 | -3.091683 | 1.859932 | 3.009183 | 12.148416 | 4662.0 | 0.341170 | 3.056326 | -4.226201 | -1.461243 | 0.024974 | 1.517538 | 18.630773 | 4662.0 | -0.890986 | 2.047665 | -7.233860 | -2.242361 | -1.035290 | 0.325174 | 11.369898 | 4662.0 | 0.835170 | 0.803478 | -1.567229 | 0.421948 | 0.914513 | 1.355521 | 3.062267 | 4662.0 | 0.413911 | 0.571758 | -2.512437 | 0.074772 | 0.404666 | 0.816989 | 2.228885 | 4662.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
6 | 10193.0 | 74865.558030 | 43552.630106 | 0.0 | 37048.00 | 74727.0 | 112624.00 | 149990.0 | 10193.0 | 67759.671049 | 61047.953072 | 7.0 | 10249.00 | 50267.0 | 117899.00 | 196763.0 | 10193.0 | 2.003343e+07 | 49126.243235 | 19910004.0 | 20000005.00 | 20030504.0 | 20070307.00 | 20151210.0 | 10193.0 | 50.601001 | 32.602429 | 1.0 | 30.0 | 46.0 | 69.0 | 236.0 | 9823.0 | 1.739998 | 1.477240 | 0.0 | 1.0 | 1.0 | 2.0 | 7.0 | 9451.0 | 0.331605 | 0.507588 | 0.0 | 0.0 | 0.0 | 1.0 | 5.0 | 9741.0 | 0.079355 | ... | 0.088849 | 0.174175 | 10193.0 | 0.453570 | 3.751521 | -7.417392 | -3.318933 | 2.097469 | 3.290343 | 12.357011 | 10193.0 | 0.647936 | 3.466200 | -4.644868 | -1.297772 | 0.104708 | 1.730303 | 18.765443 | 10193.0 | -1.049689 | 2.415648 | -8.293423 | -2.783272 | -1.408243 | 0.606464 | 11.741638 | 10193.0 | 0.503864 | 0.827696 | -1.663949 | -0.018258 | 0.465792 | 1.047989 | 3.503605 | 10193.0 | 0.562719 | 0.793285 | -4.100143 | 0.163964 | 0.617502 | 1.076853 | 2.467846 | 10193.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
7 | 2360.0 | 75250.205932 | 43678.252285 | 9.0 | 37552.25 | 74878.0 | 113342.25 | 149850.0 | 2360.0 | 69299.372034 | 58127.088681 | 9.0 | 17252.75 | 51459.5 | 116960.75 | 196742.0 | 2360.0 | 2.002641e+07 | 50934.788003 | 19910002.0 | 19990307.75 | 20030005.0 | 20060903.00 | 20151201.0 | 2360.0 | 67.324576 | 58.199505 | 1.0 | 7.0 | 78.0 | 90.0 | 195.0 | 2293.0 | 2.193197 | 1.888402 | 0.0 | 0.0 | 2.0 | 4.0 | 7.0 | 2243.0 | 0.236737 | 0.472855 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 2278.0 | 0.056190 | ... | 0.097156 | 0.154929 | 2360.0 | 0.043372 | 3.698984 | -7.335358 | -3.682216 | 1.797291 | 2.898247 | 11.363643 | 2360.0 | -0.079306 | 3.112975 | -4.762878 | -1.830181 | -0.447594 | 1.017011 | 18.478554 | 2360.0 | -0.608090 | 2.274225 | -6.211157 | -2.247694 | -0.774330 | 0.897483 | 12.433202 | 2360.0 | 0.569155 | 0.857917 | -1.487128 | 0.167047 | 0.624866 | 1.245085 | 2.403954 | 2360.0 | -0.725277 | 0.844597 | -3.576179 | -1.277176 | -0.659159 | -0.143988 | 1.800145 | 2360.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
8 | 2070.0 | 74586.202415 | 44205.548259 | 120.0 | 35745.75 | 76107.5 | 111959.50 | 149992.0 | 2070.0 | 72592.696618 | 58963.388886 | 13.0 | 18045.00 | 59220.0 | 120232.75 | 196568.0 | 2070.0 | 2.003449e+07 | 56140.318328 | 19910002.0 | 19990807.50 | 20030605.5 | 20080482.50 | 20151111.0 | 2070.0 | 80.477778 | 62.427835 | 1.0 | 32.0 | 32.0 | 129.0 | 204.0 | 2007.0 | 2.112606 | 2.169138 | 0.0 | 1.0 | 1.0 | 3.0 | 7.0 | 1936.0 | 0.264979 | 0.478513 | 0.0 | 0.0 | 0.0 | 1.0 | 5.0 | 1999.0 | 0.118559 | ... | 0.126626 | 0.182241 | 2070.0 | 0.497674 | 3.536123 | -7.035689 | -2.837964 | 1.518881 | 3.211395 | 11.426592 | 2070.0 | 0.052955 | 3.427019 | -4.865708 | -1.648288 | -0.278306 | 1.369966 | 18.504017 | 2070.0 | -0.549322 | 2.654042 | -7.585131 | -2.729013 | -1.041358 | 1.662843 | 9.340888 | 2070.0 | 1.136608 | 1.208165 | -1.755694 | 0.667204 | 1.377709 | 1.960064 | 3.639345 | 2070.0 | -0.559776 | 1.068399 | -4.548068 | -1.071647 | -0.407986 | 0.125283 | 1.869232 | 2070.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
9 | 7299.0 | 75216.050418 | 43212.153514 | 10.0 | 38410.50 | 74754.0 | 112200.50 | 149987.0 | 7299.0 | 70244.821619 | 60841.985920 | 14.0 | 13946.00 | 56217.0 | 119169.50 | 196713.0 | 7299.0 | 2.002628e+07 | 45428.588454 | 19910003.0 | 19990809.00 | 20020603.0 | 20051004.50 | 20151208.0 | 7299.0 | 42.144266 | 36.923623 | 1.0 | 10.0 | 22.0 | 66.0 | 123.0 | 7019.0 | 1.718906 | 1.372095 | 0.0 | 1.0 | 1.0 | 3.0 | 7.0 | 6722.0 | 0.235347 | 0.522330 | 0.0 | 0.0 | 0.0 | 0.0 | 6.0 | 6973.0 | 0.073139 | ... | 0.111572 | 0.184351 | 7299.0 | 0.933918 | 3.472885 | -6.263700 | -2.706162 | 2.477623 | 3.492634 | 11.894036 | 7299.0 | 0.854147 | 3.520184 | -4.321031 | -1.089517 | 0.122418 | 1.932959 | 18.819042 | 7299.0 | -1.572967 | 2.195678 | -7.407455 | -3.127826 | -1.945874 | -0.323995 | 9.711988 | 7299.0 | 1.297567 | 0.794103 | -1.347980 | 0.834339 | 1.297944 | 1.770357 | 3.632836 | 7299.0 | 0.234535 | 0.751815 | -2.813847 | -0.220190 | 0.332027 | 0.719093 | 2.278235 | 7299.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
10 | 13994.0 | 75205.874803 | 43439.758881 | 3.0 | 37610.50 | 75121.0 | 112569.50 | 149998.0 | 13994.0 | 67331.798342 | 60753.607940 | 16.0 | 11242.00 | 48541.0 | 117395.00 | 196800.0 | 13994.0 | 2.003084e+07 | 52059.493889 | 19910002.0 | 19991205.00 | 20030405.5 | 20070309.00 | 20151207.0 | 13994.0 | 38.545591 | 37.370739 | 1.0 | 17.0 | 31.0 | 33.0 | 226.0 | 13713.0 | 1.962809 | 2.024122 | 0.0 | 0.0 | 2.0 | 3.0 | 7.0 | 13398.0 | 0.505896 | 0.560648 | 0.0 | 0.0 | 0.0 | 1.0 | 6.0 | 13564.0 | 0.575420 | ... | 0.044004 | 0.125604 | 13994.0 | -0.787944 | 3.789618 | -8.436945 | -4.676338 | 1.234128 | 2.326256 | 12.118343 | 13994.0 | -0.595754 | 2.864221 | -5.366580 | -2.480332 | -0.995418 | 0.903215 | 18.455196 | 13994.0 | 0.860639 | 2.138816 | -7.195385 | -0.722816 | 0.663825 | 2.270567 | 13.562011 | 13994.0 | -1.111276 | 0.690632 | -2.828945 | -1.641285 | -1.178183 | -0.596863 | 1.446954 | 13994.0 | -0.215205 | 0.902067 | -6.113291 | -0.552021 | -0.095289 | 0.356898 | 2.113267 | 13994.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
11 | 2944.0 | 75153.740829 | 42823.474531 | 83.0 | 38343.75 | 75868.5 | 112185.00 | 149997.0 | 2944.0 | 69966.151834 | 61340.869319 | 19.0 | 12643.00 | 52956.0 | 119103.75 | 196785.0 | 2944.0 | 2.004489e+07 | 52721.427797 | 19910003.0 | 20000510.50 | 20040611.5 | 20090404.50 | 20151205.0 | 2944.0 | 83.091712 | 49.708246 | 1.0 | 60.0 | 60.0 | 116.0 | 184.0 | 2852.0 | 1.104839 | 1.168938 | 0.0 | 0.0 | 1.0 | 1.0 | 7.0 | 2785.0 | 0.303052 | 0.482534 | 0.0 | 0.0 | 0.0 | 1.0 | 3.0 | 2843.0 | 0.046782 | ... | 0.113617 | 0.189313 | 2944.0 | 0.246968 | 3.590947 | -6.711672 | -3.468983 | 1.772204 | 3.089890 | 11.675463 | 2944.0 | 0.439107 | 3.252243 | -3.911240 | -1.528835 | 0.028555 | 1.630662 | 18.484508 | 2944.0 | -0.205660 | 2.523470 | -6.919479 | -2.195230 | -0.213088 | 1.690328 | 11.443174 | 2944.0 | 1.330700 | 0.964266 | -1.619776 | 0.704863 | 1.250365 | 2.050045 | 3.896473 | 2944.0 | -0.648424 | 0.922214 | -4.576412 | -0.930180 | -0.530655 | -0.100611 | 1.702045 | 2944.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
12 | 1108.0 | 74192.030686 | 42167.462987 | 345.0 | 38334.50 | 73457.5 | 109422.75 | 149629.0 | 1108.0 | 73052.243682 | 59664.008784 | 21.0 | 17032.00 | 61923.5 | 119556.75 | 196793.0 | 1108.0 | 2.001690e+07 | 55599.544665 | 19910003.0 | 19970411.00 | 20010810.5 | 20060904.00 | 20151210.0 | 1108.0 | 55.358303 | 61.001523 | 1.0 | 15.0 | 15.0 | 131.0 | 176.0 | 1067.0 | 2.029053 | 2.268162 | 0.0 | 0.0 | 1.0 | 5.0 | 7.0 | 1044.0 | 0.182950 | 0.625487 | 0.0 | 0.0 | 0.0 | 0.0 | 5.0 | 1082.0 | 0.118299 | ... | 0.093852 | 0.139933 | 1108.0 | 0.301023 | 3.609198 | -6.522588 | -3.271745 | 2.008425 | 2.908080 | 11.364633 | 1108.0 | 0.069176 | 3.483745 | -4.505058 | -1.983762 | -0.510518 | 1.050628 | 18.213683 | 1108.0 | -0.546464 | 2.234541 | -5.240744 | -2.385848 | -0.719024 | 1.012884 | 9.901890 | 1108.0 | 0.538105 | 0.925150 | -1.562079 | 0.036841 | 0.437971 | 1.314252 | 2.635682 | 1108.0 | -0.436398 | 0.925910 | -3.297603 | -0.823304 | -0.329716 | 0.170607 | 1.689872 | 1108.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
13 | 3813.0 | 75807.485969 | 43369.185052 | 45.0 | 37548.00 | 75584.0 | 113804.00 | 149917.0 | 3813.0 | 69418.422502 | 60943.542910 | 22.0 | 10870.00 | 54468.0 | 120432.00 | 196756.0 | 3813.0 | 2.003626e+07 | 51050.488935 | 19910003.0 | 20000301.00 | 20030512.0 | 20080402.00 | 20151204.0 | 3813.0 | 75.765539 | 75.052396 | 1.0 | 16.0 | 19.0 | 164.0 | 228.0 | 3647.0 | 1.538799 | 1.379087 | 0.0 | 1.0 | 1.0 | 2.0 | 7.0 | 3536.0 | 0.228224 | 0.526208 | 0.0 | 0.0 | 0.0 | 0.0 | 6.0 | 3616.0 | 0.024336 | ... | 0.124263 | 0.176603 | 3813.0 | 0.974184 | 3.544076 | -6.729835 | -2.633895 | 2.398146 | 3.467563 | 11.974500 | 3813.0 | 0.905273 | 3.601515 | -3.991513 | -1.144972 | 0.195962 | 1.960395 | 18.423083 | 3813.0 | -1.115010 | 2.419075 | -6.660521 | -2.940837 | -1.543928 | 0.554868 | 10.251221 | 3813.0 | 1.420437 | 1.222494 | -1.314333 | 0.772731 | 1.583979 | 2.442350 | 3.990237 | 3813.0 | 0.146962 | 0.693859 | -2.795332 | -0.176193 | 0.238874 | 0.552159 | 2.039379 | 3813.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
14 | 16073.0 | 74965.405400 | 43273.533270 | 7.0 | 37602.00 | 74681.0 | 112612.00 | 149970.0 | 16073.0 | 68175.404343 | 61311.451932 | 27.0 | 11440.00 | 50531.0 | 118903.00 | 196795.0 | 16073.0 | 2.002211e+07 | 50572.857718 | 19910001.0 | 19981201.00 | 20011107.0 | 20060301.00 | 20151101.0 | 16073.0 | 53.728862 | 38.714159 | 1.0 | 26.0 | 48.0 | 73.0 | 217.0 | 15518.0 | 1.621601 | 1.457703 | 0.0 | 1.0 | 1.0 | 2.0 | 7.0 | 14883.0 | 0.251092 | 0.516146 | 0.0 | 0.0 | 0.0 | 0.0 | 6.0 | 15382.0 | 0.101547 | ... | 0.095271 | 0.177625 | 16073.0 | 0.628787 | 3.556379 | -6.866449 | -3.019136 | 2.207823 | 3.344779 | 12.034736 | 16073.0 | 0.635476 | 3.371117 | -4.474906 | -1.263805 | 0.108876 | 1.680342 | 18.746229 | 16073.0 | -1.346439 | 2.356970 | -9.639552 | -3.098923 | -1.703890 | 0.123554 | 10.681064 | 16073.0 | 0.897454 | 0.792761 | -1.401177 | 0.255511 | 0.882386 | 1.475575 | 3.462262 | 16073.0 | 0.416238 | 0.985198 | -3.412186 | -0.002721 | 0.585602 | 1.044119 | 2.743993 | 16073.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
15 | 1458.0 | 75836.777778 | 42919.175888 | 2.0 | 39051.25 | 74694.0 | 113072.75 | 149878.0 | 1458.0 | 54366.361454 | 59170.040930 | 496.0 | 5185.00 | 25150.0 | 99853.00 | 196730.0 | 1458.0 | 2.007449e+07 | 38355.204312 | 19911010.0 | 20050202.25 | 20080107.5 | 20100909.00 | 20151101.0 | 1458.0 | 91.209191 | 52.526146 | 1.0 | 20.0 | 115.0 | 115.0 | 208.0 | 1430.0 | 1.951049 | 1.551219 | 0.0 | 1.0 | 1.0 | 4.0 | 7.0 | 1421.0 | 0.104856 | 0.308765 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 1434.0 | 0.068340 | ... | 0.081460 | 0.158983 | 1458.0 | -1.638239 | 3.886271 | -6.725212 | -5.291464 | -3.185584 | 2.086631 | 10.866295 | 1458.0 | -0.200208 | 2.910238 | -4.449206 | -1.998758 | -0.043786 | 1.805868 | 18.026409 | 1458.0 | 2.254994 | 1.428281 | -5.134526 | 1.352491 | 2.064085 | 3.018247 | 10.737935 | 1458.0 | -0.021208 | 0.878520 | -2.068686 | -0.602893 | -0.168820 | 0.539850 | 2.368825 | 1458.0 | -0.584835 | 1.093781 | -4.302939 | -1.218293 | -0.244310 | 0.129999 | 1.787471 | 1458.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
16 | 2219.0 | 73605.734114 | 43241.191320 | 24.0 | 36326.50 | 73669.0 | 110490.50 | 149967.0 | 2219.0 | 69268.566021 | 62853.985952 | 32.0 | 7904.00 | 53724.0 | 122929.00 | 196810.0 | 2219.0 | 2.005215e+07 | 42439.839640 | 19950003.0 | 20020007.50 | 20050309.0 | 20081210.00 | 20151203.0 | 2219.0 | 35.250113 | 43.977668 | 1.0 | 21.0 | 21.0 | 21.0 | 169.0 | 2179.0 | 2.031666 | 1.535143 | 0.0 | 1.0 | 1.0 | 4.0 | 7.0 | 2128.0 | 0.184680 | 0.446698 | 0.0 | 0.0 | 0.0 | 0.0 | 6.0 | 1828.0 | 0.833698 | ... | 0.079165 | 0.175786 | 2219.0 | 0.232958 | 3.355363 | -6.007778 | -3.303033 | 1.975813 | 2.850004 | 12.062635 | 2219.0 | 0.023905 | 2.749959 | -3.747783 | -1.521359 | -0.554427 | 1.770565 | 17.938531 | 2219.0 | -0.006882 | 1.636191 | -4.818190 | -1.197193 | -0.141742 | 1.050875 | 9.422496 | 2219.0 | 0.356979 | 0.919330 | -1.609028 | -0.338461 | 0.323656 | 0.799167 | 3.687518 | 2219.0 | 0.283270 | 0.975365 | -3.641332 | 0.062668 | 0.456052 | 0.922392 | 1.673653 | 2219.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
17 | 913.0 | 76391.495071 | 43001.469264 | 127.0 | 42395.00 | 76124.0 | 114306.00 | 149902.0 | 913.0 | 74632.959474 | 59677.582227 | 63.0 | 19057.00 | 60976.0 | 123652.00 | 196611.0 | 913.0 | 2.003280e+07 | 40929.411229 | 19910009.0 | 20001201.00 | 20030312.0 | 20060701.00 | 20151012.0 | 913.0 | 53.189485 | 44.684545 | 1.0 | 19.0 | 35.0 | 55.0 | 234.0 | 876.0 | 1.410959 | 1.752078 | 0.0 | 0.0 | 1.0 | 2.0 | 7.0 | 845.0 | 0.340828 | 0.503355 | 0.0 | 0.0 | 0.0 | 1.0 | 2.0 | 879.0 | 0.067122 | ... | 0.112651 | 0.162548 | 913.0 | 0.416094 | 3.722685 | -6.531455 | -3.334165 | 2.140332 | 3.061162 | 12.180503 | 913.0 | 0.266060 | 3.476105 | -4.446526 | -1.645925 | -0.416363 | 1.441443 | 18.198690 | 913.0 | -0.719356 | 2.331593 | -6.479402 | -2.384324 | -1.261515 | 0.950344 | 9.375095 | 913.0 | 0.910572 | 1.047373 | -1.642076 | 0.148259 | 1.109650 | 1.731515 | 2.921252 | 913.0 | -0.056988 | 0.883488 | -1.889225 | -0.860490 | -0.030175 | 0.765552 | 1.649960 | 913.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
18 | 315.0 | 77633.352381 | 42433.451781 | 189.0 | 41542.00 | 77918.0 | 116743.50 | 149618.0 | 315.0 | 82412.244444 | 58198.407943 | 67.0 | 30605.00 | 70827.0 | 137309.50 | 193771.0 | 315.0 | 2.001261e+07 | 51348.748716 | 19910411.0 | 19971208.00 | 20000609.0 | 20050109.50 | 20150808.0 | 315.0 | 100.771429 | 70.724403 | 1.0 | 37.0 | 72.0 | 149.0 | 211.0 | 302.0 | 1.645695 | 1.573334 | 0.0 | 0.0 | 2.0 | 2.0 | 7.0 | 291.0 | 0.154639 | 0.491448 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 306.0 | 0.160131 | ... | 0.110953 | 0.157512 | 315.0 | 0.887018 | 3.563810 | -6.671605 | -2.337864 | 2.286829 | 3.274892 | 11.002059 | 315.0 | 0.166556 | 3.711119 | -4.157351 | -1.944996 | -0.785210 | 1.349590 | 18.702045 | 315.0 | -0.595573 | 2.602414 | -5.592622 | -2.584811 | -1.188427 | 1.283593 | 10.152891 | 315.0 | 0.851147 | 0.982307 | -1.331271 | 0.144258 | 0.752683 | 1.595511 | 3.357300 | 315.0 | -0.977489 | 1.076947 | -3.405637 | -1.797545 | -0.772322 | -0.296763 | 1.913713 | 315.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
19 | 1386.0 | 75560.020924 | 43706.641419 | 108.0 | 36555.50 | 76038.5 | 113820.75 | 149928.0 | 1386.0 | 78210.521645 | 59147.648804 | 141.0 | 25283.50 | 67916.5 | 126183.00 | 196712.0 | 1386.0 | 2.002140e+07 | 52415.137034 | 19910002.0 | 19980705.00 | 20010802.0 | 20060705.00 | 20150905.0 | 1386.0 | 111.703463 | 74.065212 | 1.0 | 38.0 | 59.0 | 178.0 | 233.0 | 1361.0 | 2.072006 | 1.587391 | 0.0 | 2.0 | 2.0 | 2.0 | 6.0 | 1322.0 | 0.482602 | 0.627469 | 0.0 | 0.0 | 0.0 | 1.0 | 5.0 | 1345.0 | 0.318216 | ... | 0.100461 | 0.168084 | 1386.0 | 0.458372 | 3.322205 | -7.524217 | -2.880910 | 1.835827 | 3.014082 | 12.169001 | 1386.0 | -0.467656 | 2.853088 | -4.801876 | -2.113042 | -0.981589 | 1.185698 | 17.911345 | 1386.0 | -0.300360 | 2.549155 | -6.156476 | -2.391214 | -0.695172 | 1.675446 | 10.368790 | 1386.0 | 0.314171 | 1.097748 | -1.889330 | -0.408416 | 0.170611 | 0.989450 | 3.056540 | 1386.0 | -0.829898 | 1.467238 | -4.522631 | -2.368066 | -0.222489 | 0.294714 | 2.166740 | 1386.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
20 | 1235.0 | 72852.993522 | 42770.279495 | 270.0 | 36730.00 | 71808.0 | 109085.00 | 149923.0 | 1235.0 | 75610.621053 | 57519.555046 | 69.0 | 25344.00 | 62591.0 | 124960.50 | 196521.0 | 1235.0 | 2.002248e+07 | 49976.251607 | 19910005.0 | 19990006.00 | 20010911.0 | 20060301.00 | 20150912.0 | 1235.0 | 96.965182 | 72.403099 | 1.0 | 19.0 | 71.0 | 148.0 | 225.0 | 1175.0 | 1.939574 | 2.228954 | 0.0 | 0.0 | 1.0 | 3.0 | 7.0 | 1142.0 | 0.237303 | 0.518451 | 0.0 | 0.0 | 0.0 | 0.0 | 5.0 | 1190.0 | 0.164706 | ... | 0.111390 | 0.177199 | 1235.0 | 0.785034 | 3.588094 | -7.092171 | -2.418497 | 2.126236 | 3.257914 | 11.325653 | 1235.0 | 0.307363 | 3.950357 | -4.688188 | -1.583725 | -0.531796 | 1.171405 | 18.635290 | 1235.0 | -1.070952 | 2.549985 | -9.223993 | -3.036032 | -1.425503 | 0.578787 | 8.568987 | 1235.0 | 0.920612 | 1.223262 | -1.451778 | -0.168708 | 1.165518 | 1.948650 | 3.089669 | 1235.0 | -0.227102 | 1.388666 | -4.633527 | -0.666089 | 0.063963 | 0.836313 | 1.857025 | 1235.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
21 | 1546.0 | 74411.115136 | 43042.507428 | 17.0 | 37334.75 | 75272.5 | 110606.00 | 149710.0 | 1546.0 | 68804.486417 | 57792.723668 | 83.0 | 19015.00 | 54175.0 | 113818.50 | 196158.0 | 1546.0 | 2.007134e+07 | 46009.516333 | 19921211.0 | 20040403.00 | 20071209.0 | 20110307.75 | 20151206.0 | 1546.0 | 65.705045 | 45.426932 | 1.0 | 19.0 | 82.0 | 82.0 | 191.0 | 1521.0 | 2.564103 | 2.358356 | 0.0 | 1.0 | 1.0 | 6.0 | 7.0 | 1486.0 | 0.337147 | 0.557827 | 0.0 | 0.0 | 0.0 | 1.0 | 5.0 | 1503.0 | 0.126414 | ... | 0.070734 | 0.177306 | 1546.0 | -0.475685 | 3.523654 | -6.666113 | -3.958469 | 1.187783 | 2.397824 | 11.487130 | 1546.0 | -0.432908 | 2.722832 | -4.713837 | -2.167777 | -0.668103 | 0.938675 | 17.991023 | 1546.0 | 0.525370 | 2.047954 | -6.774670 | -1.059614 | 0.709748 | 2.010039 | 11.683460 | 1546.0 | 0.311608 | 1.191928 | -1.841718 | -0.594075 | 0.300731 | 0.677414 | 3.113212 | 1546.0 | 0.095564 | 1.064292 | -3.496441 | -0.488067 | 0.231581 | 0.948296 | 2.017902 | 1546.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
22 | 1085.0 | 74106.372350 | 43456.470274 | 286.0 | 36621.00 | 73611.0 | 111236.00 | 149966.0 | 1085.0 | 75645.853456 | 57361.844785 | 354.0 | 24505.00 | 64182.0 | 123326.00 | 196675.0 | 1085.0 | 2.006849e+07 | 42384.456746 | 19940404.0 | 20040502.00 | 20060910.0 | 20100704.00 | 20151110.0 | 1085.0 | 88.134562 | 53.247892 | 1.0 | 58.0 | 95.0 | 118.0 | 187.0 | 1069.0 | 2.827877 | 2.350458 | 0.0 | 1.0 | 2.0 | 6.0 | 7.0 | 1041.0 | 0.480307 | 0.569958 | 0.0 | 0.0 | 0.0 | 1.0 | 2.0 | 1064.0 | 0.219925 | ... | 0.119286 | 0.180232 | 1085.0 | -0.068847 | 3.459512 | -7.377229 | -3.491510 | 1.554127 | 2.618712 | 11.358895 | 1085.0 | -0.900918 | 2.624135 | -4.631540 | -2.373848 | -1.220473 | 0.308024 | 17.850123 | 1085.0 | 0.652282 | 2.106158 | -5.786973 | -0.849187 | 0.660569 | 2.171080 | 9.804298 | 1085.0 | 0.990317 | 1.206859 | -1.828631 | 0.235624 | 1.122912 | 1.895986 | 3.488786 | 1085.0 | -1.325744 | 1.051142 | -3.483484 | -2.099871 | -1.535887 | -0.535032 | 1.632716 | 1085.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
23 | 183.0 | 71463.065574 | 42681.617690 | 981.0 | 37411.50 | 72602.0 | 106686.50 | 149073.0 | 183.0 | 91299.289617 | 56983.542288 | 200.0 | 42267.50 | 84050.0 | 142807.00 | 194901.0 | 183.0 | 2.001607e+07 | 47984.952170 | 19910601.0 | 19980557.00 | 20010012.0 | 20041205.00 | 20150712.0 | 183.0 | 141.879781 | 76.034237 | 1.0 | 147.0 | 147.0 | 198.0 | 246.0 | 177.0 | 1.361582 | 1.008089 | 0.0 | 1.0 | 1.0 | 2.0 | 5.0 | 170.0 | 0.241176 | 0.505073 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 174.0 | 0.132184 | ... | 0.151933 | 0.195777 | 183.0 | 1.859704 | 2.923402 | -4.659997 | -1.039074 | 2.913294 | 3.682503 | 10.600353 | 183.0 | 0.413718 | 3.339626 | -3.996334 | -1.139900 | -0.230774 | 1.316077 | 18.134975 | 183.0 | -1.458836 | 2.478543 | -6.134694 | -3.138181 | -2.029954 | -0.044675 | 5.744553 | 183.0 | 1.685663 | 1.572607 | -1.319020 | 0.641498 | 1.953308 | 2.837793 | 4.207282 | 183.0 | -0.584089 | 0.954694 | -3.119766 | -0.832251 | -0.573239 | -0.049845 | 1.813045 | 183.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
24 | 630.0 | 77544.577778 | 43472.950598 | 104.0 | 41483.00 | 77434.0 | 116516.75 | 149770.0 | 630.0 | 68077.846032 | 63207.834220 | 754.0 | 6010.00 | 48344.0 | 123193.25 | 196698.0 | 630.0 | 2.004602e+07 | 55756.401134 | 19910101.0 | 20010705.00 | 20050706.0 | 20081009.75 | 20151004.0 | 630.0 | 141.434921 | 57.203244 | 1.0 | 135.0 | 167.0 | 167.0 | 196.0 | 609.0 | 4.722496 | 0.940763 | 0.0 | 4.0 | 5.0 | 5.0 | 7.0 | 596.0 | 0.115772 | 0.364408 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 605.0 | 0.454545 | ... | 0.010444 | 0.096948 | 630.0 | -2.181525 | 4.474586 | -8.563886 | -7.114687 | 0.431129 | 0.936029 | 11.517281 | 630.0 | -2.068203 | 3.619911 | -5.403044 | -4.436895 | -3.733340 | -0.299474 | 17.321064 | 630.0 | 4.232475 | 1.772247 | -2.501647 | 3.090258 | 3.970751 | 5.314654 | 13.847792 | 630.0 | -2.606416 | 1.006631 | -4.153899 | -3.483702 | -2.789147 | -1.705373 | -0.369503 | 630.0 | -1.547405 | 1.520931 | -4.990900 | -2.049624 | -1.381883 | -0.481947 | 1.464521 | 630.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
25 | 2059.0 | 74127.034483 | 43501.338220 | 37.0 | 37139.00 | 74159.0 | 112051.50 | 149751.0 | 2059.0 | 78456.486644 | 59066.256877 | 270.0 | 22920.00 | 68579.0 | 126283.00 | 196732.0 | 2059.0 | 2.004729e+07 | 45101.627596 | 19910301.0 | 20011109.00 | 20050512.0 | 20080610.00 | 20151205.0 | 2059.0 | 77.587664 | 59.440144 | 1.0 | 19.0 | 74.0 | 107.0 | 213.0 | 2003.0 | 1.918622 | 1.544501 | 0.0 | 1.0 | 1.0 | 3.0 | 7.0 | 1964.0 | 0.402749 | 0.550286 | 0.0 | 0.0 | 0.0 | 1.0 | 6.0 | 1990.0 | 0.137186 | ... | 0.129609 | 0.179783 | 2059.0 | 0.631146 | 3.364486 | -5.958284 | -3.020590 | 2.142861 | 2.997208 | 11.318870 | 2059.0 | -0.011960 | 3.091690 | -4.229087 | -1.762297 | -0.722704 | 1.159274 | 18.672101 | 2059.0 | -0.291924 | 2.100629 | -7.142263 | -1.774015 | -0.336165 | 1.088873 | 9.376154 | 2059.0 | 0.970131 | 1.428631 | -1.543721 | -0.544231 | 1.109999 | 2.224228 | 3.841097 | 2059.0 | -0.709130 | 0.910082 | -2.964917 | -1.563683 | -0.554168 | -0.095176 | 1.545084 | 2059.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
26 | 878.0 | 77493.378132 | 43487.023734 | 347.0 | 40607.75 | 76995.5 | 115285.25 | 149956.0 | 878.0 | 88815.642369 | 58856.666895 | 319.0 | 37001.00 | 84736.5 | 138247.25 | 196809.0 | 878.0 | 2.003434e+07 | 57256.432799 | 19910012.0 | 20000002.00 | 20030909.0 | 20077805.25 | 20151211.0 | 878.0 | 1.000000 | 0.000000 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 736.0 | 3.501359 | 2.472158 | 0.0 | 1.0 | 4.0 | 6.0 | 7.0 | 712.0 | 0.667135 | 1.085715 | 0.0 | 0.0 | 0.0 | 1.0 | 6.0 | 714.0 | 0.523810 | ... | 0.031951 | 0.115086 | 878.0 | 1.585720 | 4.774101 | -7.936043 | -2.071304 | 1.999438 | 3.112465 | 12.319303 | 878.0 | 1.122851 | 6.295803 | -4.734807 | -2.794448 | -1.101424 | 0.872000 | 18.563847 | 878.0 | 1.514269 | 2.664720 | -6.849828 | -0.248631 | 1.420006 | 2.910000 | 12.964502 | 878.0 | -1.450474 | 0.622820 | -2.683358 | -2.035405 | -1.523719 | -0.929298 | -0.012991 | 878.0 | 0.794297 | 0.602347 | -1.421036 | 0.360552 | 0.820368 | 1.257269 | 2.228209 | 878.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
27 | 2049.0 | 73845.705710 | 43240.478883 | 15.0 | 36399.00 | 73615.0 | 110825.00 | 149922.0 | 2049.0 | 69949.441191 | 59332.550713 | 456.0 | 15265.00 | 52631.0 | 117151.00 | 196757.0 | 2049.0 | 2.004516e+07 | 52069.929164 | 19910002.0 | 20010306.00 | 20050901.0 | 20080906.00 | 20150611.0 | 2049.0 | 119.454856 | 58.426871 | 1.0 | 111.0 | 136.0 | 160.0 | 219.0 | 2013.0 | 1.959762 | 1.921902 | 0.0 | 1.0 | 1.0 | 3.0 | 7.0 | 1969.0 | 0.376333 | 0.797652 | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 | 1996.0 | 0.132766 | ... | 0.117030 | 0.171262 | 2049.0 | -0.349411 | 3.401541 | -8.385159 | -3.775698 | 1.388747 | 2.568366 | 11.438472 | 2049.0 | -0.220547 | 2.787389 | -4.900100 | -1.958291 | -0.485905 | 1.291661 | 17.955811 | 2049.0 | 0.433205 | 1.927001 | -5.487213 | -1.092251 | 0.502035 | 1.570864 | 9.238937 | 2049.0 | 0.909657 | 1.208100 | -1.819820 | 0.267599 | 1.155180 | 1.789408 | 3.036236 | 2049.0 | -1.148241 | 1.261924 | -4.288604 | -1.986162 | -1.035812 | -0.208064 | 1.791176 | 2049.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
28 | 633.0 | 74738.657188 | 43356.425965 | 267.0 | 37797.00 | 76432.0 | 112074.00 | 149999.0 | 633.0 | 75250.409163 | 57674.072361 | 328.0 | 21680.00 | 63202.0 | 118631.00 | 196106.0 | 633.0 | 2.006528e+07 | 52284.870060 | 19910003.0 | 20050609.00 | 20070810.0 | 20100907.00 | 20140312.0 | 633.0 | 83.764613 | 79.789794 | 1.0 | 19.0 | 19.0 | 177.0 | 238.0 | 625.0 | 2.417600 | 2.278666 | 0.0 | 1.0 | 1.0 | 5.0 | 7.0 | 603.0 | 0.406302 | 0.726179 | 0.0 | 0.0 | 0.0 | 1.0 | 5.0 | 619.0 | 0.268174 | ... | 0.145658 | 0.182138 | 633.0 | 0.083255 | 3.220259 | -6.585369 | -2.903576 | 1.680640 | 2.600357 | 11.081233 | 633.0 | -0.673638 | 2.632410 | -4.493970 | -2.368460 | -1.022317 | 0.951035 | 17.505573 | 633.0 | 0.662696 | 1.697063 | -4.261720 | -0.609220 | 0.507734 | 1.879061 | 8.397233 | 633.0 | 0.913849 | 1.738917 | -1.765759 | -0.652658 | 0.216346 | 2.616105 | 3.267012 | 633.0 | -0.725659 | 1.529047 | -4.800336 | -0.651846 | -0.316054 | 0.056592 | 1.681307 | 633.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
29 | 406.0 | 74621.884236 | 43236.633955 | 228.0 | 37153.50 | 73950.5 | 111987.25 | 149794.0 | 406.0 | 70945.206897 | 56446.897666 | 364.0 | 19719.00 | 55950.5 | 115637.75 | 191417.0 | 406.0 | 2.010308e+07 | 22641.540390 | 19990307.0 | 20090311.00 | 20101007.0 | 20120607.75 | 20150801.0 | 406.0 | 145.527094 | 53.246485 | 1.0 | 97.0 | 153.0 | 203.0 | 220.0 | 402.0 | 2.766169 | 2.209716 | 0.0 | 1.0 | 2.0 | 6.0 | 7.0 | 387.0 | 0.410853 | 0.643167 | 0.0 | 0.0 | 0.0 | 1.0 | 3.0 | 396.0 | 0.000000 | ... | 0.154961 | 0.180703 | 406.0 | -0.790372 | 3.346689 | -6.638600 | -4.004475 | 1.189307 | 2.337868 | 9.911740 | 406.0 | -0.815966 | 2.505750 | -4.278372 | -2.240155 | -1.316043 | 1.003902 | 18.215280 | 406.0 | 1.264502 | 1.273312 | -1.366109 | 0.238120 | 1.313790 | 2.125154 | 7.968338 | 406.0 | 2.348716 | 0.891323 | -1.267262 | 1.340144 | 2.740750 | 2.959609 | 3.670654 | 406.0 | -1.569358 | 1.180541 | -4.288898 | -2.280369 | -2.059071 | -0.173354 | 1.671054 | 406.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
30 | 940.0 | 75172.069149 | 44258.585908 | 223.0 | 36734.00 | 74946.0 | 114503.50 | 149916.0 | 940.0 | 70067.446809 | 57915.463701 | 368.0 | 20215.00 | 52366.0 | 116240.00 | 196292.0 | 940.0 | 2.003896e+07 | 55861.120829 | 19910005.0 | 19990907.50 | 20050009.5 | 20080806.00 | 20151207.0 | 940.0 | 71.619149 | 69.901983 | 1.0 | 19.0 | 19.0 | 137.0 | 194.0 | 914.0 | 2.840263 | 2.467461 | 0.0 | 1.0 | 1.0 | 6.0 | 7.0 | 885.0 | 0.148023 | 0.388769 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 905.0 | 0.081768 | ... | 0.112232 | 0.177734 | 940.0 | -0.126412 | 3.654306 | -6.275393 | -3.720092 | 1.457661 | 2.748777 | 11.517406 | 940.0 | 0.005554 | 3.280222 | -4.358551 | -1.716405 | -0.350104 | 1.357139 | 18.455192 | 940.0 | -0.173341 | 2.100211 | -6.730851 | -1.817245 | -0.051923 | 1.165228 | 9.049487 | 940.0 | 0.576410 | 1.292044 | -1.367847 | -0.634741 | 0.557611 | 1.816135 | 3.403661 | 940.0 | -0.826164 | 1.106440 | -4.161693 | -1.277038 | -0.576715 | -0.192327 | 1.843991 | 940.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
31 | 318.0 | 79211.823899 | 44752.440924 | 986.0 | 38857.25 | 81918.5 | 119649.75 | 149436.0 | 318.0 | 72795.921384 | 59748.139844 | 19.0 | 19852.00 | 55504.5 | 124235.25 | 196533.0 | 318.0 | 2.002378e+07 | 41200.588927 | 19920009.0 | 19991135.25 | 20030156.5 | 20060481.00 | 20120903.0 | 318.0 | 122.251572 | 66.370955 | 1.0 | 100.0 | 100.0 | 150.0 | 241.0 | 299.0 | 1.515050 | 1.537760 | 0.0 | 1.0 | 1.0 | 1.0 | 7.0 | 285.0 | 0.021053 | 0.204472 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 300.0 | 0.110000 | ... | 0.154065 | 0.197007 | 318.0 | 1.544940 | 3.692167 | -5.214919 | -2.356698 | 2.989602 | 3.932320 | 11.722887 | 318.0 | 1.596768 | 4.075336 | -3.331715 | -0.364359 | 0.453222 | 2.303626 | 18.434987 | 318.0 | -2.145505 | 2.219577 | -7.476219 | -3.721772 | -2.628344 | -0.437923 | 5.561065 | 318.0 | 2.790518 | 1.253698 | -0.601478 | 2.628097 | 3.147084 | 3.627427 | 4.574046 | 318.0 | -0.098571 | 1.047429 | -3.530033 | -0.350457 | 0.116856 | 0.605951 | 1.781382 | 318.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
32 | 588.0 | 77254.421769 | 43933.469591 | 785.0 | 38649.50 | 79253.0 | 116300.50 | 149906.0 | 588.0 | 80697.831633 | 60478.938739 | 532.0 | 23408.00 | 73592.0 | 133244.25 | 196673.0 | 588.0 | 2.002166e+07 | 37778.359656 | 19910012.0 | 19991207.50 | 20020457.5 | 20050602.25 | 20150007.0 | 588.0 | 101.000000 | 74.183804 | 1.0 | 19.0 | 120.0 | 173.0 | 185.0 | 568.0 | 2.457746 | 1.557247 | 0.0 | 2.0 | 3.0 | 3.0 | 7.0 | 551.0 | 0.471869 | 0.631469 | 0.0 | 0.0 | 0.0 | 1.0 | 2.0 | 571.0 | 0.553415 | ... | 0.109876 | 0.148974 | 588.0 | 0.433381 | 3.612880 | -6.647469 | -3.321114 | 2.084440 | 2.931624 | 11.528951 | 588.0 | -0.221359 | 3.340916 | -4.232155 | -2.162154 | -1.201379 | 1.061553 | 17.634468 | 588.0 | -0.626338 | 2.320674 | -7.328804 | -2.191988 | -1.058943 | 0.740408 | 9.804773 | 588.0 | 0.564083 | 1.066244 | -1.316508 | -0.479304 | 0.759701 | 1.385450 | 2.695913 | 588.0 | -1.007297 | 1.020659 | -4.544762 | -1.584126 | -0.847772 | -0.353823 | 1.404809 | 588.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
33 | 201.0 | 74372.567164 | 46552.749727 | 213.0 | 34590.00 | 68461.0 | 118484.00 | 149600.0 | 201.0 | 81124.049751 | 57825.076347 | 726.0 | 30423.00 | 70408.0 | 127691.00 | 196467.0 | 201.0 | 2.002732e+07 | 49375.949773 | 19910111.0 | 20000403.00 | 20021211.0 | 20061008.00 | 20150808.0 | 201.0 | 111.517413 | 80.162903 | 1.0 | 19.0 | 179.0 | 181.0 | 181.0 | 198.0 | 0.555556 | 1.364690 | 0.0 | 0.0 | 0.0 | 0.0 | 5.0 | 195.0 | 0.420513 | 0.589941 | 0.0 | 0.0 | 0.0 | 1.0 | 2.0 | 198.0 | 0.792929 | ... | 0.101085 | 0.122254 | 201.0 | -0.382310 | 3.667147 | -6.699000 | -4.361581 | 1.701981 | 2.388892 | 9.788658 | 201.0 | -0.963164 | 2.475391 | -4.042492 | -2.448779 | -1.813465 | 0.695686 | 14.265191 | 201.0 | 1.162601 | 1.828595 | -1.835974 | -0.105305 | 0.825324 | 2.461365 | 8.332826 | 201.0 | -0.212318 | 1.028219 | -2.181042 | -1.122769 | 0.195374 | 0.548407 | 1.846387 | 201.0 | -1.430185 | 1.426232 | -3.699757 | -2.725109 | -1.730609 | -0.153571 | 1.100867 | 201.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
34 | 227.0 | 67690.295154 | 45058.232029 | 821.0 | 26593.50 | 63861.0 | 108637.00 | 149962.0 | 227.0 | 86537.942731 | 58063.238676 | 733.0 | 39316.00 | 75063.0 | 135185.00 | 196322.0 | 227.0 | 2.001670e+07 | 25854.786441 | 19940305.0 | 20000306.50 | 20021009.0 | 20040357.00 | 20090806.0 | 227.0 | 103.903084 | 63.722708 | 1.0 | 92.0 | 92.0 | 141.0 | 216.0 | 214.0 | 1.116822 | 1.130207 | 0.0 | 1.0 | 1.0 | 1.0 | 7.0 | 206.0 | 0.106796 | 0.450747 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 218.0 | 0.073394 | ... | 0.180723 | 0.213617 | 227.0 | 2.336748 | 3.069714 | -3.994545 | -0.675070 | 3.451118 | 3.842282 | 11.482525 | 227.0 | 1.297720 | 3.818776 | -1.979440 | -0.628523 | 0.084570 | 2.241391 | 18.004163 | 227.0 | -2.717730 | 2.046480 | -7.591009 | -3.716544 | -2.843971 | -2.143089 | 5.869691 | 227.0 | 3.593484 | 1.683694 | 0.175089 | 3.653462 | 4.219027 | 4.847437 | 5.249750 | 227.0 | -0.478732 | 0.482595 | -1.466562 | -0.833232 | -0.483200 | -0.137173 | 0.978073 | 227.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
35 | 180.0 | 75761.288889 | 42658.662014 | 1239.0 | 42427.25 | 76158.0 | 109473.00 | 149835.0 | 180.0 | 92504.377778 | 56658.832239 | 1985.0 | 43453.50 | 89256.0 | 137603.50 | 193029.0 | 180.0 | 1.999566e+07 | 26177.571767 | 19920202.0 | 19980753.00 | 19991104.0 | 20010412.00 | 20081012.0 | 180.0 | 20.794444 | 31.562600 | 1.0 | 19.0 | 19.0 | 19.0 | 240.0 | 174.0 | 1.137931 | 1.774400 | 0.0 | 0.0 | 0.0 | 1.0 | 7.0 | 163.0 | 0.171779 | 0.438785 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 171.0 | 0.128655 | ... | 0.072706 | 0.111065 | 180.0 | 2.698322 | 2.546303 | -5.300156 | 2.679139 | 3.218441 | 3.602274 | 11.216075 | 180.0 | -0.132526 | 3.185104 | -4.492133 | -1.452299 | -0.933803 | 0.004659 | 15.908381 | 180.0 | -2.118162 | 1.833393 | -7.024219 | -3.163407 | -2.343118 | -1.363827 | 4.675358 | 180.0 | -0.310187 | 0.406609 | -0.710799 | -0.534089 | -0.451840 | -0.195753 | 2.305511 | 180.0 | -0.269934 | 0.703368 | -4.304987 | -0.470896 | -0.285957 | 0.087244 | 1.152364 | 180.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
36 | 228.0 | 73029.513158 | 41254.861198 | 699.0 | 36801.25 | 74890.0 | 106929.50 | 147599.0 | 228.0 | 95786.000000 | 58318.848570 | 2310.0 | 45353.25 | 93729.5 | 145547.00 | 196115.0 | 228.0 | 2.000187e+07 | 43634.811252 | 19910001.0 | 19971006.50 | 20000703.5 | 20030533.75 | 20120201.0 | 228.0 | 68.736842 | 84.506711 | 1.0 | 19.0 | 19.0 | 205.0 | 232.0 | 226.0 | 2.075221 | 2.010772 | 0.0 | 0.0 | 2.0 | 4.0 | 7.0 | 210.0 | 0.280952 | 0.604702 | 0.0 | 0.0 | 0.0 | 0.0 | 5.0 | 219.0 | 0.210046 | ... | 0.088218 | 0.178423 | 228.0 | 1.428753 | 2.682799 | -5.395128 | 1.532496 | 2.541814 | 2.996332 | 10.525269 | 228.0 | -1.140752 | 2.139584 | -3.857012 | -2.153057 | -1.579656 | -0.685238 | 17.097484 | 228.0 | -0.905136 | 1.914516 | -5.375736 | -2.362553 | -1.033816 | 0.415038 | 4.007988 | 228.0 | -0.026382 | 0.938113 | -1.156501 | -0.848372 | -0.379545 | 1.012891 | 2.450867 | 228.0 | -0.661150 | 0.820463 | -2.472309 | -1.188085 | -0.478278 | -0.047181 | 0.935694 | 228.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
37 | 331.0 | 73778.833837 | 44279.708500 | 57.0 | 34365.50 | 73581.0 | 111356.00 | 149816.0 | 331.0 | 82216.232628 | 58785.168955 | 1747.0 | 30364.50 | 69461.0 | 136735.50 | 192926.0 | 331.0 | 2.005355e+07 | 53794.575230 | 19910106.0 | 20010907.50 | 20060112.0 | 20091008.00 | 20151210.0 | 330.0 | 190.275758 | 28.105528 | 1.0 | 189.0 | 200.0 | 202.0 | 206.0 | 324.0 | 5.978395 | 0.355622 | 0.0 | 6.0 | 6.0 | 6.0 | 7.0 | 326.0 | 0.861963 | 0.371228 | 0.0 | 1.0 | 1.0 | 1.0 | 2.0 | 324.0 | 0.478395 | ... | 0.106693 | 0.222787 | 331.0 | -1.205835 | 3.866438 | -8.798810 | -5.546108 | 0.586353 | 1.297289 | 10.207696 | 331.0 | -2.449968 | 3.002255 | -5.391154 | -4.375784 | -3.313433 | -0.703420 | 14.724600 | 331.0 | 2.681481 | 2.293266 | -2.854245 | 0.874165 | 2.821037 | 4.536863 | 9.983565 | 331.0 | 0.091308 | 1.051330 | -2.255182 | -0.374834 | 0.060246 | 0.455534 | 11.147669 | 331.0 | -4.089576 | 1.169474 | -6.546556 | -4.561176 | -4.118236 | -3.682716 | 8.658418 | 331.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
38 | 65.0 | 79568.169231 | 46074.928132 | 570.0 | 35584.00 | 98020.0 | 115964.00 | 147553.0 | 65.0 | 69180.092308 | 53590.168691 | 2632.0 | 21519.00 | 56816.0 | 111613.00 | 189410.0 | 65.0 | 2.006727e+07 | 42559.876836 | 19950008.0 | 20050701.00 | 20080205.0 | 20091003.00 | 20150205.0 | 65.0 | 171.600000 | 83.524847 | 1.0 | 214.0 | 214.0 | 214.0 | 242.0 | 60.0 | 4.666667 | 2.282332 | 0.0 | 2.0 | 6.0 | 6.0 | 6.0 | 60.0 | 0.350000 | 0.755208 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 60.0 | 0.000000 | ... | 0.156860 | 0.210402 | 65.0 | 0.925148 | 4.186859 | -5.543358 | -3.030515 | 1.955995 | 2.588624 | 11.613387 | 65.0 | -0.364350 | 4.634212 | -3.655055 | -2.905654 | -1.627678 | 0.216899 | 16.388386 | 65.0 | -0.025510 | 2.116160 | -3.477392 | -1.090534 | -0.339330 | 0.525587 | 8.297318 | 65.0 | 2.623626 | 1.387277 | -0.887793 | 2.542658 | 3.255139 | 3.416094 | 4.599514 | 65.0 | -1.855674 | 1.140203 | -3.268925 | -2.505578 | -2.328672 | -1.755200 | 0.878431 | 65.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
39 | 9.0 | 70365.222222 | 39637.457808 | 5144.0 | 49803.00 | 76258.0 | 99049.00 | 127071.0 | 9.0 | 74224.666667 | 59824.384652 | 22825.0 | 38387.00 | 50778.0 | 65608.00 | 181810.0 | 9.0 | 2.000169e+07 | 75637.080552 | 19910707.0 | 19950009.00 | 19981203.0 | 20040710.00 | 20150402.0 | 9.0 | 86.000000 | 118.727629 | 1.0 | 1.0 | 19.0 | 244.0 | 244.0 | 7.0 | 2.571429 | 2.439750 | 0.0 | 0.5 | 2.0 | 4.5 | 6.0 | 6.0 | 0.166667 | 0.408248 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 7.0 | 0.000000 | ... | 0.072366 | 0.184668 | 9.0 | 1.515873 | 2.985496 | -6.214428 | 1.717704 | 2.118526 | 3.142073 | 3.399385 | 9.0 | 2.910655 | 8.745510 | -3.436782 | -1.548087 | -1.059497 | 0.697001 | 18.197699 | 9.0 | 0.760231 | 2.972780 | -2.862146 | -1.400021 | -0.249955 | 1.914129 | 5.452985 | 9.0 | 0.496650 | 1.477420 | -0.897785 | -0.674703 | 0.075403 | 1.532134 | 3.577009 | 9.0 | -0.899515 | 1.813946 | -3.620942 | -2.306780 | 0.000658 | 0.258403 | 1.005671 | 9.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
40 rows × 240 columns
# 计算某品牌的销售统计量,同学们还可以计算其他特征的统计量
# 这里要以 train 的数据计算统计量
train_gb = train.groupby("brand")
all_info = {}
for kind, kind_data in train_gb:
info = {}
#print('kind/n',kind)
#print('kind_data/n',kind_data)
kind_data = kind_data[kind_data['price'] > 0]
info['brand_amount'] = len(kind_data)
info['brand_price_max'] = kind_data.price.max()
info['brand_price_median'] = kind_data.price.median()
info['brand_price_min'] = kind_data.price.min()
info['brand_price_sum'] = kind_data.price.sum()
info['brand_price_std'] = kind_data.price.std()
info['brand_price_average'] = round(kind_data.price.sum() / (len(kind_data) + 1), 2)
all_info[kind] = info
brand_fe = pd.DataFrame(all_info).T.reset_index().rename(columns={"index": "brand"})
#brand_fe.head()
data = data.merge(brand_fe, how='left', on='brand')
brand_fe.head()
brand | brand_amount | brand_price_max | brand_price_median | brand_price_min | brand_price_sum | brand_price_std | brand_price_average | |
---|---|---|---|---|---|---|---|---|
0 | 0 | 31429.0 | 68500.0 | 3199.0 | 13.0 | 173719698.0 | 6261.371627 | 5527.19 |
1 | 1 | 13656.0 | 84000.0 | 6399.0 | 15.0 | 124044603.0 | 8988.865406 | 9082.86 |
2 | 2 | 318.0 | 55800.0 | 7500.0 | 35.0 | 3766241.0 | 10576.224444 | 11806.40 |
3 | 3 | 2461.0 | 37500.0 | 4990.0 | 65.0 | 15954226.0 | 5396.327503 | 6480.19 |
4 | 4 | 16575.0 | 99999.0 | 5999.0 | 12.0 | 138279069.0 | 8089.863295 | 8342.13 |
data.head()
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | notRepairedDamage | regionCode | seller | offerType | creatDate | price | v_0 | v_1 | v_2 | v_3 | v_4 | v_5 | v_6 | v_7 | v_8 | v_9 | v_10 | v_11 | v_12 | v_13 | v_14 | train | used_time | city | brand_amount | brand_price_max | brand_price_median | brand_price_min | brand_price_sum | brand_price_std | brand_price_average | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 736 | 20040402 | 30.0 | 6 | 1.0 | 0.0 | 0.0 | 60 | 12.5 | 0.0 | 1046 | 0 | 0 | 20160404 | 1850.0 | 43.357796 | 3.966344 | 0.050257 | 2.159744 | 1.143786 | 0.235676 | 0.101988 | 0.129549 | 0.022816 | 0.097462 | -2.881803 | 2.804097 | -2.420821 | 0.795292 | 0.914762 | 1 | 4385.0 | 1 | 10193.0 | 35990.0 | 1800.0 | 13.0 | 36457518.0 | 4562.233331 | 3576.37 |
1 | 1 | 2262 | 20030301 | 40.0 | 1 | 2.0 | 0.0 | 0.0 | 0 | 15.0 | - | 4366 | 0 | 0 | 20160309 | 3600.0 | 45.305273 | 5.236112 | 0.137925 | 1.380657 | -1.422165 | 0.264777 | 0.121004 | 0.135731 | 0.026597 | 0.020582 | -4.900482 | 2.096338 | -1.030483 | -1.722674 | 0.245522 | 1 | 4757.0 | 4 | 13656.0 | 84000.0 | 6399.0 | 15.0 | 124044603.0 | 8988.865406 | 9082.86 |
2 | 2 | 14874 | 20040403 | 115.0 | 15 | 1.0 | 0.0 | 0.0 | 163 | 12.5 | 0.0 | 2806 | 0 | 0 | 20160402 | 6222.0 | 45.978359 | 4.823792 | 1.319524 | -0.998467 | -0.996911 | 0.251410 | 0.114912 | 0.165147 | 0.062173 | 0.027075 | -4.846749 | 1.803559 | 1.565330 | -0.832687 | -0.229963 | 1 | 4382.0 | 2 | 1458.0 | 45000.0 | 8500.0 | 100.0 | 14373814.0 | 5425.058140 | 9851.83 |
3 | 3 | 71865 | 19960908 | 109.0 | 10 | 0.0 | 0.0 | 1.0 | 193 | 15.0 | 0.0 | 434 | 0 | 0 | 20160312 | 2400.0 | 45.687478 | 4.492574 | -0.050616 | 0.883600 | -2.228079 | 0.274293 | 0.110300 | 0.121964 | 0.033395 | 0.000000 | -4.509599 | 1.285940 | -0.501868 | -2.438353 | -0.478699 | 1 | 7125.0 | 0 | 13994.0 | 92900.0 | 5200.0 | 15.0 | 113034210.0 | 8244.695287 | 8076.76 |
4 | 4 | 111080 | 20120103 | 110.0 | 5 | 1.0 | 0.0 | 0.0 | 68 | 5.0 | 0.0 | 6977 | 0 | 0 | 20160313 | 5200.0 | 44.383511 | 2.031433 | 0.572169 | -1.571239 | 2.246088 | 0.228036 | 0.073205 | 0.091880 | 0.078819 | 0.121534 | -1.896240 | 0.910783 | 0.931110 | 2.834518 | 1.923482 | 1 | 1531.0 | 6 | 4662.0 | 31500.0 | 2300.0 | 20.0 | 15414322.0 | 3344.689763 | 3305.67 |
# 数据分桶 以 power 为例
# 这时候我们的缺失值也进桶了,
# 为什么要做数据分桶呢,原因有很多,= =
# 1. 离散后稀疏向量内积乘法运算速度更快,计算结果也方便存储,容易扩展;(one_hot的优点)
# 2. 离散后的特征对异常值更具鲁棒性,如 age>30 为 1 否则为 0,对于年龄为 200 的也不会对模型造成很大的干扰;
# 3. LR 属于广义线性模型,表达能力有限,经过离散化后,每个变量有单独的权重,这相当于引入了非线性,能够提升模型的表达能力,加大拟合;(one_hot 优点)
# 4. 离散后特征可以进行特征交叉,提升表达能力,由 M+N 个变量编程 M*N 个变量,进一步引入非线形,提升了表达能力;(one_hot优点)
# 5. 特征离散后模型更稳定,如用户年龄区间,不会因为用户年龄长了一岁就变化
# 当然还有很多原因,LightGBM 在改进 XGBoost 时就增加了数据分桶,增强了模型的泛化性
bin = [i*10 for i in range(31)]
data['power_bin'] = pd.cut(data['power'], bin, labels=False)
data[['power_bin', 'power']].head()
power_bin | power | |
---|---|---|
0 | 5.0 | 60 |
1 | NaN | 0 |
2 | 16.0 | 163 |
3 | 19.0 | 193 |
4 | 6.0 | 68 |
data['power_bin']#由于设置了label=False,power_bin 分别表示power被分到了第几个桶中
0 5.0
1 NaN
2 16.0
3 19.0
4 6.0
...
199032 11.0
199033 7.0
199034 22.0
199035 NaN
199036 6.0
Name: power_bin, Length: 199037, dtype: float64
# 利用好了,就可以删掉原始数据了
data = data.drop(['creatDate', 'regDate', 'regionCode'], axis=1)
print(data.shape)
data.columns
(199037, 39)
Index(['SaleID', 'name', 'model', 'brand', 'bodyType', 'fuelType', 'gearbox',
'power', 'kilometer', 'notRepairedDamage', 'seller', 'offerType',
'price', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8',
'v_9', 'v_10', 'v_11', 'v_12', 'v_13', 'v_14', 'train', 'used_time',
'city', 'brand_amount', 'brand_price_max', 'brand_price_median',
'brand_price_min', 'brand_price_sum', 'brand_price_std',
'brand_price_average', 'power_bin'],
dtype='object')
# 目前的数据其实已经可以给树模型使用了,所以我们导出一下
data.to_csv('data_for_tree.csv', index=0)
data.head()
SaleID | name | model | brand | bodyType | fuelType | gearbox | power | kilometer | notRepairedDamage | seller | offerType | price | v_0 | v_1 | v_2 | v_3 | v_4 | v_5 | v_6 | v_7 | v_8 | v_9 | v_10 | v_11 | v_12 | v_13 | v_14 | train | used_time | city | brand_amount | brand_price_max | brand_price_median | brand_price_min | brand_price_sum | brand_price_std | brand_price_average | power_bin | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 736 | 30.0 | 6 | 1.0 | 0.0 | 0.0 | 60 | 12.5 | 0.0 | 0 | 0 | 1850.0 | 43.357796 | 3.966344 | 0.050257 | 2.159744 | 1.143786 | 0.235676 | 0.101988 | 0.129549 | 0.022816 | 0.097462 | -2.881803 | 2.804097 | -2.420821 | 0.795292 | 0.914762 | 1 | 4385.0 | 1 | 10193.0 | 35990.0 | 1800.0 | 13.0 | 36457518.0 | 4562.233331 | 3576.37 | 5.0 |
1 | 1 | 2262 | 40.0 | 1 | 2.0 | 0.0 | 0.0 | 0 | 15.0 | - | 0 | 0 | 3600.0 | 45.305273 | 5.236112 | 0.137925 | 1.380657 | -1.422165 | 0.264777 | 0.121004 | 0.135731 | 0.026597 | 0.020582 | -4.900482 | 2.096338 | -1.030483 | -1.722674 | 0.245522 | 1 | 4757.0 | 4 | 13656.0 | 84000.0 | 6399.0 | 15.0 | 124044603.0 | 8988.865406 | 9082.86 | NaN |
2 | 2 | 14874 | 115.0 | 15 | 1.0 | 0.0 | 0.0 | 163 | 12.5 | 0.0 | 0 | 0 | 6222.0 | 45.978359 | 4.823792 | 1.319524 | -0.998467 | -0.996911 | 0.251410 | 0.114912 | 0.165147 | 0.062173 | 0.027075 | -4.846749 | 1.803559 | 1.565330 | -0.832687 | -0.229963 | 1 | 4382.0 | 2 | 1458.0 | 45000.0 | 8500.0 | 100.0 | 14373814.0 | 5425.058140 | 9851.83 | 16.0 |
3 | 3 | 71865 | 109.0 | 10 | 0.0 | 0.0 | 1.0 | 193 | 15.0 | 0.0 | 0 | 0 | 2400.0 | 45.687478 | 4.492574 | -0.050616 | 0.883600 | -2.228079 | 0.274293 | 0.110300 | 0.121964 | 0.033395 | 0.000000 | -4.509599 | 1.285940 | -0.501868 | -2.438353 | -0.478699 | 1 | 7125.0 | 0 | 13994.0 | 92900.0 | 5200.0 | 15.0 | 113034210.0 | 8244.695287 | 8076.76 | 19.0 |
4 | 4 | 111080 | 110.0 | 5 | 1.0 | 0.0 | 0.0 | 68 | 5.0 | 0.0 | 0 | 0 | 5200.0 | 44.383511 | 2.031433 | 0.572169 | -1.571239 | 2.246088 | 0.228036 | 0.073205 | 0.091880 | 0.078819 | 0.121534 | -1.896240 | 0.910783 | 0.931110 | 2.834518 | 1.923482 | 1 | 1531.0 | 6 | 4662.0 | 31500.0 | 2300.0 | 20.0 | 15414322.0 | 3344.689763 | 3305.67 | 6.0 |
# 我们可以再构造一份特征给 LR NN 之类的模型用
# 之所以分开构造是因为,不同模型对数据集的要求不同
# 我们看下数据分布:
data['power'].plot.hist()
# 我们刚刚已经对 train 进行异常值处理了,但是现在还有这么奇怪的分布是因为 test 中的 power 异常值,
# 所以我们其实刚刚 train 中的 power 异常值不删为好,可以用长尾分布截断来代替
train['power'].plot.hist()
# 我们对其取 log,在做归一化
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
data['power'] = np.log(data['power'] + 1)
data['power'] = ((data['power'] - np.min(data['power'])) / (np.max(data['power']) - np.min(data['power'])))
data['power'].plot.hist()
# km 的比较正常,应该是已经做过分桶了
data['kilometer'].plot.hist()
data['kilometer'].value_counts()#从这里我们可以直观的看到为啥判定kilometer可能是分桶过的,他的类别只有这有限的几类。
15.0 128682
12.5 20958
10.0 8506
9.0 6992
8.0 6043
7.0 5442
6.0 4886
5.0 4197
4.0 3576
3.0 3309
2.0 3034
0.5 2431
1.0 981
Name: kilometer, dtype: int64
data['power_bin'].plot.hist()#画一下power分桶之后的效果,看一下
# 所以我们可以直接做归一化
data['kilometer'] = ((data['kilometer'] - np.min(data['kilometer'])) /
(np.max(data['kilometer']) - np.min(data['kilometer'])))
data['kilometer'].plot.hist()
# 除此之外 还有我们刚刚构造的统计量特征:
# 'brand_amount', 'brand_price_average', 'brand_price_max',
# 'brand_price_median', 'brand_price_min', 'brand_price_std',
# 'brand_price_sum'
# 这里不再一一举例分析了,直接做变换,
def max_min(x):
return (x - np.min(x)) / (np.max(x) - np.min(x))
data['brand_amount'] = ((data['brand_amount'] - np.min(data['brand_amount'])) /
(np.max(data['brand_amount']) - np.min(data['brand_amount'])))
data['brand_price_average'] = ((data['brand_price_average'] - np.min(data['brand_price_average'])) /
(np.max(data['brand_price_average']) - np.min(data['brand_price_average'])))
data['brand_price_max'] = ((data['brand_price_max'] - np.min(data['brand_price_max'])) /
(np.max(data['brand_price_max']) - np.min(data['brand_price_max'])))
data['brand_price_median'] = ((data['brand_price_median'] - np.min(data['brand_price_median'])) /
(np.max(data['brand_price_median']) - np.min(data['brand_price_median'])))
data['brand_price_min'] = ((data['brand_price_min'] - np.min(data['brand_price_min'])) /
(np.max(data['brand_price_min']) - np.min(data['brand_price_min'])))
data['brand_price_std'] = ((data['brand_price_std'] - np.min(data['brand_price_std'])) /
(np.max(data['brand_price_std']) - np.min(data['brand_price_std'])))
data['brand_price_sum'] = ((data['brand_price_sum'] - np.min(data['brand_price_sum'])) /
(np.max(data['brand_price_sum']) - np.min(data['brand_price_sum'])))
data.head()
SaleID | name | model | brand | bodyType | fuelType | gearbox | power | kilometer | notRepairedDamage | seller | offerType | price | v_0 | v_1 | v_2 | v_3 | v_4 | v_5 | v_6 | v_7 | v_8 | v_9 | v_10 | v_11 | v_12 | v_13 | v_14 | train | used_time | city | brand_amount | brand_price_max | brand_price_median | brand_price_min | brand_price_sum | brand_price_std | brand_price_average | power_bin | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 736 | 30.0 | 6 | 1.0 | 0.0 | 0.0 | 0.415091 | 0.827586 | 0.0 | 0 | 0 | 1850.0 | 43.357796 | 3.966344 | 0.050257 | 2.159744 | 1.143786 | 0.235676 | 0.101988 | 0.129549 | 0.022816 | 0.097462 | -2.881803 | 2.804097 | -2.420821 | 0.795292 | 0.914762 | 1 | 4385.0 | 1 | 0.324125 | 0.340786 | 0.032075 | 0.002064 | 0.209684 | 0.207660 | 0.081655 | 5.0 |
1 | 1 | 2262 | 40.0 | 1 | 2.0 | 0.0 | 0.0 | 0.000000 | 1.000000 | - | 0 | 0 | 3600.0 | 45.305273 | 5.236112 | 0.137925 | 1.380657 | -1.422165 | 0.264777 | 0.121004 | 0.135731 | 0.026597 | 0.020582 | -4.900482 | 2.096338 | -1.030483 | -1.722674 | 0.245522 | 1 | 4757.0 | 4 | 0.434341 | 0.835230 | 0.205623 | 0.004128 | 0.713985 | 0.437002 | 0.257305 | NaN |
2 | 2 | 14874 | 115.0 | 15 | 1.0 | 0.0 | 0.0 | 0.514954 | 0.827586 | 0.0 | 0 | 0 | 6222.0 | 45.978359 | 4.823792 | 1.319524 | -0.998467 | -0.996911 | 0.251410 | 0.114912 | 0.165147 | 0.062173 | 0.027075 | -4.846749 | 1.803559 | 1.565330 | -0.832687 | -0.229963 | 1 | 4382.0 | 2 | 0.046117 | 0.433578 | 0.284906 | 0.091847 | 0.082533 | 0.252362 | 0.281834 | 16.0 |
3 | 3 | 71865 | 109.0 | 10 | 0.0 | 0.0 | 1.0 | 0.531917 | 1.000000 | 0.0 | 0 | 0 | 2400.0 | 45.687478 | 4.492574 | -0.050616 | 0.883600 | -2.228079 | 0.274293 | 0.110300 | 0.121964 | 0.033395 | 0.000000 | -4.509599 | 1.285940 | -0.501868 | -2.438353 | -0.478699 | 1 | 7125.0 | 0 | 0.445099 | 0.926889 | 0.160377 | 0.004128 | 0.650591 | 0.398447 | 0.225212 | 19.0 |
4 | 4 | 111080 | 110.0 | 5 | 1.0 | 0.0 | 0.0 | 0.427535 | 0.310345 | 0.0 | 0 | 0 | 5200.0 | 44.383511 | 2.031433 | 0.572169 | -1.571239 | 2.246088 | 0.228036 | 0.073205 | 0.091880 | 0.078819 | 0.121534 | -1.896240 | 0.910783 | 0.931110 | 2.834518 | 1.923482 | 1 | 1531.0 | 6 | 0.148090 | 0.294545 | 0.050943 | 0.009288 | 0.088524 | 0.144579 | 0.073020 | 6.0 |
# 对类别特征进行 OneEncoder
data = pd.get_dummies(data, columns=['model', 'brand', 'bodyType', 'fuelType',
'gearbox', 'notRepairedDamage', 'power_bin'])
data.head()
SaleID | name | power | kilometer | seller | offerType | price | v_0 | v_1 | v_2 | v_3 | v_4 | v_5 | v_6 | v_7 | v_8 | v_9 | v_10 | v_11 | v_12 | v_13 | v_14 | train | used_time | city | brand_amount | brand_price_max | brand_price_median | brand_price_min | brand_price_sum | brand_price_std | brand_price_average | model_0.0 | model_1.0 | model_2.0 | model_3.0 | model_4.0 | model_5.0 | model_6.0 | model_7.0 | model_8.0 | model_9.0 | model_10.0 | model_11.0 | model_12.0 | model_13.0 | model_14.0 | model_15.0 | model_16.0 | model_17.0 | ... | bodyType_0.0 | bodyType_1.0 | bodyType_2.0 | bodyType_3.0 | bodyType_4.0 | bodyType_5.0 | bodyType_6.0 | bodyType_7.0 | fuelType_0.0 | fuelType_1.0 | fuelType_2.0 | fuelType_3.0 | fuelType_4.0 | fuelType_5.0 | fuelType_6.0 | gearbox_0.0 | gearbox_1.0 | notRepairedDamage_- | notRepairedDamage_0.0 | notRepairedDamage_1.0 | power_bin_0.0 | power_bin_1.0 | power_bin_2.0 | power_bin_3.0 | power_bin_4.0 | power_bin_5.0 | power_bin_6.0 | power_bin_7.0 | power_bin_8.0 | power_bin_9.0 | power_bin_10.0 | power_bin_11.0 | power_bin_12.0 | power_bin_13.0 | power_bin_14.0 | power_bin_15.0 | power_bin_16.0 | power_bin_17.0 | power_bin_18.0 | power_bin_19.0 | power_bin_20.0 | power_bin_21.0 | power_bin_22.0 | power_bin_23.0 | power_bin_24.0 | power_bin_25.0 | power_bin_26.0 | power_bin_27.0 | power_bin_28.0 | power_bin_29.0 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 736 | 0.415091 | 0.827586 | 0 | 0 | 1850.0 | 43.357796 | 3.966344 | 0.050257 | 2.159744 | 1.143786 | 0.235676 | 0.101988 | 0.129549 | 0.022816 | 0.097462 | -2.881803 | 2.804097 | -2.420821 | 0.795292 | 0.914762 | 1 | 4385.0 | 1 | 0.324125 | 0.340786 | 0.032075 | 0.002064 | 0.209684 | 0.207660 | 0.081655 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 1 | 2262 | 0.000000 | 1.000000 | 0 | 0 | 3600.0 | 45.305273 | 5.236112 | 0.137925 | 1.380657 | -1.422165 | 0.264777 | 0.121004 | 0.135731 | 0.026597 | 0.020582 | -4.900482 | 2.096338 | -1.030483 | -1.722674 | 0.245522 | 1 | 4757.0 | 4 | 0.434341 | 0.835230 | 0.205623 | 0.004128 | 0.713985 | 0.437002 | 0.257305 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 2 | 14874 | 0.514954 | 0.827586 | 0 | 0 | 6222.0 | 45.978359 | 4.823792 | 1.319524 | -0.998467 | -0.996911 | 0.251410 | 0.114912 | 0.165147 | 0.062173 | 0.027075 | -4.846749 | 1.803559 | 1.565330 | -0.832687 | -0.229963 | 1 | 4382.0 | 2 | 0.046117 | 0.433578 | 0.284906 | 0.091847 | 0.082533 | 0.252362 | 0.281834 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 3 | 71865 | 0.531917 | 1.000000 | 0 | 0 | 2400.0 | 45.687478 | 4.492574 | -0.050616 | 0.883600 | -2.228079 | 0.274293 | 0.110300 | 0.121964 | 0.033395 | 0.000000 | -4.509599 | 1.285940 | -0.501868 | -2.438353 | -0.478699 | 1 | 7125.0 | 0 | 0.445099 | 0.926889 | 0.160377 | 0.004128 | 0.650591 | 0.398447 | 0.225212 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 4 | 111080 | 0.427535 | 0.310345 | 0 | 0 | 5200.0 | 44.383511 | 2.031433 | 0.572169 | -1.571239 | 2.246088 | 0.228036 | 0.073205 | 0.091880 | 0.078819 | 0.121534 | -1.896240 | 0.910783 | 0.931110 | 2.834518 | 1.923482 | 1 | 1531.0 | 6 | 0.148090 | 0.294545 | 0.050943 | 0.009288 | 0.088524 | 0.144579 | 0.073020 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 370 columns
print(data.shape)
data.columns
(199037, 370)
Index(['SaleID', 'name', 'power', 'kilometer', 'seller', 'offerType', 'price',
'v_0', 'v_1', 'v_2',
...
'power_bin_20.0', 'power_bin_21.0', 'power_bin_22.0', 'power_bin_23.0',
'power_bin_24.0', 'power_bin_25.0', 'power_bin_26.0', 'power_bin_27.0',
'power_bin_28.0', 'power_bin_29.0'],
dtype='object', length=370)
# 这份数据可以给 LR 用
data.to_csv('data_for_lr.csv', index=0)
# 相关性分析
print(data['power'].corr(data['price'], method='spearman'))
print(data['kilometer'].corr(data['price'], method='spearman'))
print(data['brand_amount'].corr(data['price'], method='spearman'))
print(data['brand_price_average'].corr(data['price'], method='spearman'))
print(data['brand_price_max'].corr(data['price'], method='spearman'))
print(data['brand_price_median'].corr(data['price'], method='spearman'))
0.5728285196051496
-0.4082569701616764
0.058156610025581514
0.3834909576057687
0.259066833880992
0.38691042393409447
# 当然也可以直接看图
data_numeric = data[['power', 'kilometer', 'brand_amount', 'brand_price_average',
'brand_price_max', 'brand_price_median']]
correlation = data_numeric.corr()
f , ax = plt.subplots(figsize = (7, 7))
plt.title('Correlation of Numeric Features with Price',y=1,size=16)
sns.heatmap(correlation,square = True, vmax=0.8)
x.columns
Index(['SaleID', 'name', 'power', 'kilometer', 'seller', 'offerType', 'v_0',
'v_1', 'v_2', 'v_3',
...
'power_bin_20.0', 'power_bin_21.0', 'power_bin_22.0', 'power_bin_23.0',
'power_bin_24.0', 'power_bin_25.0', 'power_bin_26.0', 'power_bin_27.0',
'power_bin_28.0', 'power_bin_29.0'],
dtype='object', length=369)
# from sklearn.model_selection import cross_val_score, ShuffleSplit
# from sklearn.datasets import load_boston
# from sklearn.ensemble import RandomForestRegressor
# import numpy as np
# # Load boston housing dataset as an example
# boston = load_boston()
# X = x
# Y = y
# # names = x[0]
# rf = RandomForestRegressor(n_estimators=20, max_depth=4)
# scores = []
# # 单独采用每个特征进行建模,并进行交叉验证
# for i in range(X.shape[1]):
# score = cross_val_score(rf, X[:, i:i+1], Y, scoring="r2", # 注意X[:, i]和X[:, i:i+1]的区别
# cv=ShuffleSplit(len(X), 3, .3))
# scores.append((format(np.mean(score), '.3f')))#, names[i]
# print(sorted(scores, reverse=True))
# !pip install mlxtend
def fill(x):
if not x:
x = 0
return int(x)
data['city'] = data['city'].map(fill)
data.groupby("city").describe()
#data['city'].isnull().sum()
data['city'].value_counts()
0 48645
1 42188
2 35133
3 27325
4 19945
5 13462
6 8313
7 3887
8 139
Name: city, dtype: int64
data['price'][0:train.shape[0]].isnull().sum()
0
print (train.shape[0])
print (test.shape[0])
149037
50000
# k_feature 太大会很难跑,没服务器,所以提前 interrupt 了
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LinearRegression
sfs = SFS(LinearRegression(),
k_features=20,
forward=True,
floating=False,
scoring = 'r2',
cv = 0)
x = data.drop(['price'], axis=1)
x = x.fillna(0)[0:train.shape[0]]
y = data['price'][0:train.shape[0]]
sfs.fit(x, y)
sfs.k_feature_names_
('kilometer',
'v_3',
'v_4',
'v_6',
'v_13',
'v_14',
'used_time',
'brand_price_average',
'model_44.0',
'model_105.0',
'model_113.0',
'model_167.0',
'brand_16',
'bodyType_6.0',
'gearbox_1.0',
'power_bin_6.0',
'power_bin_18.0',
'power_bin_24.0',
'power_bin_25.0',
'power_bin_26.0')
# k_feature 太大会很难跑,没服务器,所以提前 interrupt 了
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LinearRegression
sfs = SFS(LinearRegression(),
k_features=10,
forward=True,
floating=False,
scoring = 'r2',
cv = 0)
x = data.drop(['price','city'],axis=1)
#x.head()
x = x.fillna(0)[0:train.shape[0]]
y = data['price'][0:train.shape[0]]
sfs.fit(x, y)
sfs.k_feature_names_
('kilometer',
'v_3',
'v_4',
'v_13',
'v_14',
'used_time',
'brand_price_average',
'model_167.0',
'gearbox_1.0',
'power_bin_24.0')
# k_feature=sfs.get_metric_dict()
# for fea in k_feature:
# fea=k_feature[fea]
# print(f"Feature Name:{fea['feature_names']},")
# # /t
# print(f"Avg_Soure:{fea["avg_score"]}")
x.head()
#print(train.shape[0])
SaleID | name | power | kilometer | seller | offerType | v_0 | v_1 | v_2 | v_3 | v_4 | v_5 | v_6 | v_7 | v_8 | v_9 | v_10 | v_11 | v_12 | v_13 | v_14 | train | used_time | city | brand_amount | brand_price_max | brand_price_median | brand_price_min | brand_price_sum | brand_price_std | brand_price_average | model_0.0 | model_1.0 | model_2.0 | model_3.0 | model_4.0 | model_5.0 | model_6.0 | model_7.0 | model_8.0 | model_9.0 | model_10.0 | model_11.0 | model_12.0 | model_13.0 | model_14.0 | model_15.0 | model_16.0 | model_17.0 | model_18.0 | ... | bodyType_0.0 | bodyType_1.0 | bodyType_2.0 | bodyType_3.0 | bodyType_4.0 | bodyType_5.0 | bodyType_6.0 | bodyType_7.0 | fuelType_0.0 | fuelType_1.0 | fuelType_2.0 | fuelType_3.0 | fuelType_4.0 | fuelType_5.0 | fuelType_6.0 | gearbox_0.0 | gearbox_1.0 | notRepairedDamage_- | notRepairedDamage_0.0 | notRepairedDamage_1.0 | power_bin_0.0 | power_bin_1.0 | power_bin_2.0 | power_bin_3.0 | power_bin_4.0 | power_bin_5.0 | power_bin_6.0 | power_bin_7.0 | power_bin_8.0 | power_bin_9.0 | power_bin_10.0 | power_bin_11.0 | power_bin_12.0 | power_bin_13.0 | power_bin_14.0 | power_bin_15.0 | power_bin_16.0 | power_bin_17.0 | power_bin_18.0 | power_bin_19.0 | power_bin_20.0 | power_bin_21.0 | power_bin_22.0 | power_bin_23.0 | power_bin_24.0 | power_bin_25.0 | power_bin_26.0 | power_bin_27.0 | power_bin_28.0 | power_bin_29.0 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 736 | 0.415091 | 0.827586 | 0 | 0 | 43.357796 | 3.966344 | 0.050257 | 2.159744 | 1.143786 | 0.235676 | 0.101988 | 0.129549 | 0.022816 | 0.097462 | -2.881803 | 2.804097 | -2.420821 | 0.795292 | 0.914762 | 1 | 4385.0 | 1 | 0.324125 | 0.340786 | 0.032075 | 0.002064 | 0.209684 | 0.207660 | 0.081655 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 1 | 2262 | 0.000000 | 1.000000 | 0 | 0 | 45.305273 | 5.236112 | 0.137925 | 1.380657 | -1.422165 | 0.264777 | 0.121004 | 0.135731 | 0.026597 | 0.020582 | -4.900482 | 2.096338 | -1.030483 | -1.722674 | 0.245522 | 1 | 4757.0 | 4 | 0.434341 | 0.835230 | 0.205623 | 0.004128 | 0.713985 | 0.437002 | 0.257305 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 2 | 14874 | 0.514954 | 0.827586 | 0 | 0 | 45.978359 | 4.823792 | 1.319524 | -0.998467 | -0.996911 | 0.251410 | 0.114912 | 0.165147 | 0.062173 | 0.027075 | -4.846749 | 1.803559 | 1.565330 | -0.832687 | -0.229963 | 1 | 4382.0 | 2 | 0.046117 | 0.433578 | 0.284906 | 0.091847 | 0.082533 | 0.252362 | 0.281834 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 3 | 71865 | 0.531917 | 1.000000 | 0 | 0 | 45.687478 | 4.492574 | -0.050616 | 0.883600 | -2.228079 | 0.274293 | 0.110300 | 0.121964 | 0.033395 | 0.000000 | -4.509599 | 1.285940 | -0.501868 | -2.438353 | -0.478699 | 1 | 7125.0 | 0 | 0.445099 | 0.926889 | 0.160377 | 0.004128 | 0.650591 | 0.398447 | 0.225212 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 4 | 111080 | 0.427535 | 0.310345 | 0 | 0 | 44.383511 | 2.031433 | 0.572169 | -1.571239 | 2.246088 | 0.228036 | 0.073205 | 0.091880 | 0.078819 | 0.121534 | -1.896240 | 0.910783 | 0.931110 | 2.834518 | 1.923482 | 1 | 1531.0 | 6 | 0.148090 | 0.294545 | 0.050943 | 0.009288 | 0.088524 | 0.144579 | 0.073020 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 369 columns
# 画出来,可以看到边际效益
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
import matplotlib.pyplot as plt
fig1 = plot_sfs(sfs.get_metric_dict(), kind='std_dev')
plt.grid()
plt.show()
F:\dev\anaconda\envs\python35\lib\site-packages\numpy\core\_methods.py:217: RuntimeWarning: Degrees of freedom <= 0 for slice
keepdims=keepdims)
F:\dev\anaconda\envs\python35\lib\site-packages\numpy\core\_methods.py:209: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)
pd.DataFrame.from_dict(sfs.get_metric_dict()).T
F:\dev\anaconda\envs\python35\lib\site-packages\numpy\core\_methods.py:217: RuntimeWarning: Degrees of freedom <= 0 for slice
keepdims=keepdims)
F:\dev\anaconda\envs\python35\lib\site-packages\numpy\core\_methods.py:209: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)
feature_idx | cv_scores | avg_score | feature_names | ci_bound | std_dev | std_err | |
---|---|---|---|---|---|---|---|
1 | (9,) | [0.5580593794194673] | 0.558059 | (v_3,) | NaN | 0 | NaN |
2 | (9, 30) | [0.6253563249806938] | 0.625356 | (v_3, brand_price_average) | NaN | 0 | NaN |
3 | (3, 9, 30) | [0.6614119003709955] | 0.661412 | (kilometer, v_3, brand_price_average) | NaN | 0 | NaN |
4 | (3, 9, 30, 335) | [0.6712706106724942] | 0.671271 | (kilometer, v_3, brand_price_average, gearbox_... | NaN | 0 | NaN |
5 | (3, 9, 30, 198, 335) | [0.6801326459700268] | 0.680133 | (kilometer, v_3, brand_price_average, model_16... | NaN | 0 | NaN |
6 | (3, 9, 22, 30, 198, 335) | [0.686927264547389] | 0.686927 | (kilometer, v_3, used_time, brand_price_averag... | NaN | 0 | NaN |
7 | (3, 9, 19, 22, 30, 198, 335) | [0.6941981569972937] | 0.694198 | (kilometer, v_3, v_13, used_time, brand_price_... | NaN | 0 | NaN |
8 | (3, 9, 10, 19, 22, 30, 198, 335) | [0.6990798224753535] | 0.69908 | (kilometer, v_3, v_4, v_13, used_time, brand_p... | NaN | 0 | NaN |
9 | (3, 9, 10, 19, 22, 30, 198, 335, 363) | [0.7036045618841336] | 0.703605 | (kilometer, v_3, v_4, v_13, used_time, brand_p... | NaN | 0 | NaN |
10 | (3, 9, 10, 19, 20, 22, 30, 198, 335, 363) | [0.7073329162983002] | 0.707333 | (kilometer, v_3, v_4, v_13, v_14, used_time, b... | NaN | 0 | NaN |
11 | (3, 9, 10, 12, 19, 20, 22, 30, 198, 335, 363) | [0.7116225332630737] | 0.711623 | (kilometer, v_3, v_4, v_6, v_13, v_14, used_ti... | NaN | 0 | NaN |
12 | (3, 9, 10, 12, 19, 20, 22, 30, 198, 295, 335, ... | [0.7152839215589477] | 0.715284 | (kilometer, v_3, v_4, v_6, v_13, v_14, used_ti... | NaN | 0 | NaN |
13 | (3, 9, 10, 12, 19, 20, 22, 30, 198, 295, 335, ... | [0.7183533830790547] | 0.718353 | (kilometer, v_3, v_4, v_6, v_13, v_14, used_ti... | NaN | 0 | NaN |
14 | (3, 9, 10, 12, 19, 20, 22, 30, 144, 198, 295, ... | [0.7210323407042653] | 0.721032 | (kilometer, v_3, v_4, v_6, v_13, v_14, used_ti... | NaN | 0 | NaN |
15 | (3, 9, 10, 12, 19, 20, 22, 30, 75, 144, 198, 2... | [0.7235732490774848] | 0.723573 | (kilometer, v_3, v_4, v_6, v_13, v_14, used_ti... | NaN | 0 | NaN |
16 | (3, 9, 10, 12, 19, 20, 22, 30, 75, 144, 198, 2... | [0.726091372443646] | 0.726091 | (kilometer, v_3, v_4, v_6, v_13, v_14, used_ti... | NaN | 0 | NaN |
17 | (3, 9, 10, 12, 19, 20, 22, 30, 75, 136, 144, 1... | [0.7286164680329102] | 0.728616 | (kilometer, v_3, v_4, v_6, v_13, v_14, used_ti... | NaN | 0 | NaN |
18 | (3, 9, 10, 12, 19, 20, 22, 30, 75, 136, 144, 1... | [0.7309480347784469] | 0.730948 | (kilometer, v_3, v_4, v_6, v_13, v_14, used_ti... | NaN | 0 | NaN |
19 | (3, 9, 10, 12, 19, 20, 22, 30, 75, 136, 144, 1... | [0.7332378240942985] | 0.733238 | (kilometer, v_3, v_4, v_6, v_13, v_14, used_ti... | NaN | 0 | NaN |
20 | (3, 9, 10, 12, 19, 20, 22, 30, 75, 136, 144, 1... | [0.7352137419490058] | 0.735214 | (kilometer, v_3, v_4, v_6, v_13, v_14, used_ti... | NaN | 0 | NaN |
# 下一章介绍,Lasso 回归和决策树可以完成嵌入式特征选择
# 大部分情况下都是用嵌入式做特征筛选
特征工程是比赛中最至关重要的的一块,特别的传统的比赛,大家的模型可能都差不多,调参带来的效果增幅是非常有限的,但特征工程的好坏往往会决定了最终的排名和成绩。
特征工程的主要目的还是在于将数据转换为能更好地表示潜在问题的特征,从而提高机器学习的性能。比如,异常值处理是为了去除噪声,填补缺失值可以加入先验知识等。
特征构造也属于特征工程的一部分,其目的是为了增强数据的表达。
有些比赛的特征是匿名特征,这导致我们并不清楚特征相互直接的关联性,这时我们就只有单纯基于特征进行处理,比如装箱,groupby,agg 等这样一些操作进行一些特征统计,此外还可以对特征进行进一步的 log,exp 等变换,或者对多个特征进行四则运算(如上面我们算出的使用时长),多项式组合等然后进行筛选。由于特性的匿名性其实限制了很多对于特征的处理,当然有些时候用 NN 去提取一些特征也会达到意想不到的良好效果。
对于知道特征含义(非匿名)的特征工程,特别是在工业类型比赛中,会基于信号处理,频域提取,丰度,偏度等构建更为有实际意义的特征,这就是结合背景的特征构建,在推荐系统中也是这样的,各种类型点击率统计,各时段统计,加用户属性的统计等等,这样一种特征构建往往要深入分析背后的业务逻辑或者说物理原理,从而才能更好的找到 magic。
当然特征工程其实是和模型结合在一起的,这就是为什么要为 LR NN 做分桶和特征归一化的原因,而对于特征的处理效果和特征重要性等往往要通过模型来验证。
总的来说,特征工程是一个入门简单,但想精通非常难的一件事。
— By: 阿泽
PS:复旦大学计算机研究生
知乎:阿泽 https://www.zhihu.com/people/is-aze(主要面向初学者的知识整理)
关于Datawhale:
Datawhale是一个专注于数据科学与AI领域的开源组织,汇集了众多领域院校和知名企业的优秀学习者,聚合了一群有开源精神和探索精神的团队成员。Datawhale 以“for the learner,和学习者一起成长”为愿景,鼓励真实地展现自我、开放包容、互信互助、敢于试错和勇于担当。同时 Datawhale 用开源的理念去探索开源内容、开源学习和开源方案,赋能人才培养,助力人才成长,建立起人与人,人与知识,人与企业和人与未来的联结。
本次数据挖掘路径学习,专题知识将在天池分享,详情可关注Datawhale:
特征工程就是把数据转化成能够更好的表示潜在问题的特征,特征工程决定了你预测的上限。
数据理解:定性的数据和定量的数据了解数据性质方便进行进行数据的的处理
数据清洗(提高数据质量):我们进行了缺失值跟异常值的处理对于长尾分布的一些特征可以用截断法来代替直接删除异常值,然后如果要用线性的模型的话我们还要对数据进行标准化归一化
特征构造(为了增强数据表达,添加先验知识):在这里我们构造了时间差的特征,但我们发现这里面会存在一些Nat的值,我的处理之给他们赋予了新构造出来的时间差这列数据的平均值来填补这些缺失值,同时我们对power进行了数据分桶添加了power_pin这个新特征。同时关于为何kilometer推测是已经进行过数据分桶我们可以通过观察原数据很明显的可以看到kilometer只有有限的几类
,对于地理信息我们因为有先验知识所以我们取出我们去除后三位留下城市信息,但又由于可能会处理后存在空值,我们用0来replace组成新的一类。通过one_hot来进行了一些非线性变化,好处写在上面了。
特征选择:
过滤式-Filter(通过特征与price之间的相关性筛选出一些特征)
包裹式-Wrapper(用贪心法来找出比较优的一组特征)
嵌入式-Embedding(学习器自动选择特征)
个人感觉对于这个题来说特征没有特别多没太有必要进行特征选择
还有要记的对一些不符合正态分布的数据进行一下取Log的处理,使其尽量的来接近正态分布