Datawhale-零基础入门数据挖掘 - 二手车交易价格预测-特征工程

Datawhale-零基础入门数据挖掘 - 二手车交易价格预测-特征工程

  • 1.baseline上面的特征工程
    • 1.1异常值处理
    • 1.2数据清洗
    • 1.3label的分布不服从正态分布,对price去log
    • 1.4特征构造:汽车的使用时间,从邮编中提取城市信息等等(baseline上用的都用了,此处就不提供代码了)
  • 2.自己的一些思路(仅供参考)
    • 2.1 根据baseline上构建brand与price得到统计特征,在此基础上,我也构建了regioncode与price,model与price的统计特征
    • 2.2根据baseline上构建brand与price得到统计特征,构造了brand,regioncode与price,brand,model与price,model,regioncode与price的统计特征,
    • 2.3 对于匿名特征,从他们与price的相关性分析可得出v_3与v_8冗余,即保留一个即可。由于匿名特征无法知道他们与price的关系,所以根据之前的特征重要性对其中几个强相关特征进行处理。处理包括:匿名特征之间的四则运算。
  • (进来了的话,望大佬们看一看最后一段,哈哈)
    • 希望上面的这些特征对大家有启发。这里我我想向大家要一份运行mae(price log处理后的)框架。哪位大佬能提供一下,不胜感激。

1.baseline上面的特征工程

1.1异常值处理

1.2数据清洗

1.3label的分布不服从正态分布,对price去log

1.4特征构造:汽车的使用时间,从邮编中提取城市信息等等(baseline上用的都用了,此处就不提供代码了)

2.自己的一些思路(仅供参考)

2.1 根据baseline上构建brand与price得到统计特征,在此基础上,我也构建了regioncode与price,model与price的统计特征

train_gb = train_data.groupby("brand")**#“regioncode”,"model")**
all_info = {}
for kind, kind_data in train_gb:
    info = {}
    kind_data = kind_data[kind_data['price'] > 0]
    info['brand_amount'] = len(kind_data)
    info['brand_price_max'] = kind_data.price.max()
    info['brand_price_median'] = kind_data.price.median()
    info['brand_price_min'] = kind_data.price.min()
    info['brand_price_sum'] = kind_data.price.sum()
    info['brand_price_std'] = kind_data.price.std()
    info['brand_price_average'] = round(kind_data.price.sum() / (len(kind_data) + 1), 2)
    all_info[kind] = info
brand_fe = pd.DataFrame(all_info).T.reset_index().rename(columns={"index": "brand"})
df = df.merge(brand_fe, how='left', on='brand')

2.2根据baseline上构建brand与price得到统计特征,构造了brand,regioncode与price,brand,model与price,model,regioncode与price的统计特征,

rb_avg_price = df.groupby(['regionCode','brand'])['price'].mean().rename('rb_avg_price').reset_index()
rb_max_price = df.groupby(['regionCode','brand'])['price'].max().rename('rb_max_price ').reset_index()
rb_min_price = df.groupby(['regionCode','brand'])['price'].min().rename('rb_min_price ').reset_index()
rb_std_price = df.groupby(['regionCode','brand'])['price'].std().rename('rb_std_price ').reset_index()
rb_med_price = df.groupby(['regionCode','brand'])['price'].median().rename('rb_med_price ').reset_index()

rm_avg_price = df.groupby(['brand','model'])['price'].mean().rename('rm_avg_price ').reset_index()
rm_max_price = df.groupby(['brand','model'])['price'].max().rename('rm_max_price').reset_index()
rm_min_price = df.groupby(['brand','model'])['price'].min().rename('rm_min_price').reset_index()
rm_std_price = df.groupby(['brand','model'])['price'].std().rename('rm_std_price').reset_index()
rm_med_price = df.groupby(['brand','model'])['price'].median().rename('rm_med_price').reset_index()

mb_avg_price = df.groupby(['model''brand'])['price'].mean().rename('mb_avg_price ').reset_index()
mb_max_price = df.groupby(['model''brand'])['price'].max().rename('mb_max_price').reset_index()
mb_min_price = df.groupby(['model''brand'])['price'].min().rename('mb_min_price').reset_index()
mb_std_price = df.groupby(['model''brand'])['price'].std().rename('mb_std_price').reset_index()
mb_med_price = df.groupby(['model''brand'])['price'].median().rename('mb_med_price').reset_index()

```python
df = df.merge(rb_avg_price, on=['regionCode','brand'], how='left')
df = df.merge(rb_max_price, on=['regionCode','brand'], how='left')
df = df.merge(rb_min_price, on=['regionCode','brand'], how='left')
df = df.merge(rb_std_price, on=['regionCode','brand'], how='left')
df = df.merge(rb_med_price, on=['regionCode','brand'], how='left')

df = df.merge(rm_avg_price, on=['regionCode','model'], how='left')
df = df.merge(rm_max_price, on=['regionCode','model'], how='left')
df = df.merge(rm_min_price, on=['regionCode','model'], how='left')
df = df.merge(rm_std_price, on=['regionCode','model'], how='left')
df = df.merge(rm_med_price, on=['regionCode','model'], how='left')

df = df.merge(mb_avg_price, on=['model','brand'], how='left')
df = df.merge(mb_max_price, on=['model','brand'], how='left')
df = df.merge(mb_min_price, on=['model','brand'], how='left')
df = df.merge(mb_std_price, on=['model','brand'], how='left')
df = df.merge(mb_med_price, on=['model','brand'], how='left')

2.3 对于匿名特征,从他们与price的相关性分析可得出v_3与v_8冗余,即保留一个即可。由于匿名特征无法知道他们与price的关系,所以根据之前的特征重要性对其中几个强相关特征进行处理。处理包括:匿名特征之间的四则运算。



##  
```python
def get_feature(data):
    
    data['v_12*v_8']=data['v_12']+data['v_8']
    data['v_12+v_8']=data['v_12']*data['v_8']
    data['v_12/v_8']=data['v_12']/data['v_8']
    data['v_12-v_8']=data['v_12']-data['v_8']
    
    data['v_12*v_0']=data['v_12']+data['v_0']
    data['v_12+v_0']=data['v_12']*data['v_0']
    data['v_12/v_0']=data['v_12']/data['v_0']
    data['v_12-v_0']=data['v_12']-data['v_0']
    
    data['v_0*v_8']=data['v_0']+data['v_8']
    data['v_0+v_8']=data['v_0']*data['v_8']
    data['v_0/v_8']=data['v_0']/data['v_8']
    data['v_0-v_8']=data['v_0']-data['v_8']
    
    data['v_12*v_8*v_0']=data['v_12']+data['v_8']+data['v_0']
    data['v_12+v_8+v_0']=data['v_12']*data['v_8']*data['v_0']
    data['v_12/v_8/v_0']=data['v_12']/data['v_8']*data['v_0']
    data['v_12-v_8-v_0']=data['v_12']-data['v_8']*data['v_0']
    return data
    


(进来了的话,望大佬们看一看最后一段,哈哈)

希望上面的这些特征对大家有启发。这里我我想向大家要一份运行mae(price log处理后的)框架。哪位大佬能提供一下,不胜感激。

你可能感兴趣的:(Datawhale-零基础入门数据挖掘 - 二手车交易价格预测-特征工程)