Datawhale-零基础入门数据挖掘 - 二手车交易价格预测-特征工程
- 1.baseline上面的特征工程
- 1.1异常值处理
- 1.2数据清洗
- 1.3label的分布不服从正态分布,对price去log
- 1.4特征构造:汽车的使用时间,从邮编中提取城市信息等等(baseline上用的都用了,此处就不提供代码了)
- 2.自己的一些思路(仅供参考)
- 2.1 根据baseline上构建brand与price得到统计特征,在此基础上,我也构建了regioncode与price,model与price的统计特征
- 2.2根据baseline上构建brand与price得到统计特征,构造了brand,regioncode与price,brand,model与price,model,regioncode与price的统计特征,
- 2.3 对于匿名特征,从他们与price的相关性分析可得出v_3与v_8冗余,即保留一个即可。由于匿名特征无法知道他们与price的关系,所以根据之前的特征重要性对其中几个强相关特征进行处理。处理包括:匿名特征之间的四则运算。
- (进来了的话,望大佬们看一看最后一段,哈哈)
- 希望上面的这些特征对大家有启发。这里我我想向大家要一份运行mae(price log处理后的)框架。哪位大佬能提供一下,不胜感激。
1.baseline上面的特征工程
1.1异常值处理
1.2数据清洗
1.3label的分布不服从正态分布,对price去log
1.4特征构造:汽车的使用时间,从邮编中提取城市信息等等(baseline上用的都用了,此处就不提供代码了)
2.自己的一些思路(仅供参考)
2.1 根据baseline上构建brand与price得到统计特征,在此基础上,我也构建了regioncode与price,model与price的统计特征
train_gb = train_data.groupby("brand")**(
all_info = {}
for kind, kind_data in train_gb:
info = {}
kind_data = kind_data[kind_data['price'] > 0]
info['brand_amount'] = len(kind_data)
info['brand_price_max'] = kind_data.price.max()
info['brand_price_median'] = kind_data.price.median()
info['brand_price_min'] = kind_data.price.min()
info['brand_price_sum'] = kind_data.price.sum()
info['brand_price_std'] = kind_data.price.std()
info['brand_price_average'] = round(kind_data.price.sum() / (len(kind_data) + 1), 2)
all_info[kind] = info
brand_fe = pd.DataFrame(all_info).T.reset_index().rename(columns={"index": "brand"})
df = df.merge(brand_fe, how='left', on='brand')
2.2根据baseline上构建brand与price得到统计特征,构造了brand,regioncode与price,brand,model与price,model,regioncode与price的统计特征,
rb_avg_price = df.groupby(['regionCode','brand'])['price'].mean().rename('rb_avg_price').reset_index()
rb_max_price = df.groupby(['regionCode','brand'])['price'].max().rename('rb_max_price ').reset_index()
rb_min_price = df.groupby(['regionCode','brand'])['price'].min().rename('rb_min_price ').reset_index()
rb_std_price = df.groupby(['regionCode','brand'])['price'].std().rename('rb_std_price ').reset_index()
rb_med_price = df.groupby(['regionCode','brand'])['price'].median().rename('rb_med_price ').reset_index()
rm_avg_price = df.groupby(['brand','model'])['price'].mean().rename('rm_avg_price ').reset_index()
rm_max_price = df.groupby(['brand','model'])['price'].max().rename('rm_max_price').reset_index()
rm_min_price = df.groupby(['brand','model'])['price'].min().rename('rm_min_price').reset_index()
rm_std_price = df.groupby(['brand','model'])['price'].std().rename('rm_std_price').reset_index()
rm_med_price = df.groupby(['brand','model'])['price'].median().rename('rm_med_price').reset_index()
mb_avg_price = df.groupby(['model''brand'])['price'].mean().rename('mb_avg_price ').reset_index()
mb_max_price = df.groupby(['model''brand'])['price'].max().rename('mb_max_price').reset_index()
mb_min_price = df.groupby(['model''brand'])['price'].min().rename('mb_min_price').reset_index()
mb_std_price = df.groupby(['model''brand'])['price'].std().rename('mb_std_price').reset_index()
mb_med_price = df.groupby(['model''brand'])['price'].median().rename('mb_med_price').reset_index()
```python
df = df.merge(rb_avg_price, on=['regionCode','brand'], how='left')
df = df.merge(rb_max_price, on=['regionCode','brand'], how='left')
df = df.merge(rb_min_price, on=['regionCode','brand'], how='left')
df = df.merge(rb_std_price, on=['regionCode','brand'], how='left')
df = df.merge(rb_med_price, on=['regionCode','brand'], how='left')
df = df.merge(rm_avg_price, on=['regionCode','model'], how='left')
df = df.merge(rm_max_price, on=['regionCode','model'], how='left')
df = df.merge(rm_min_price, on=['regionCode','model'], how='left')
df = df.merge(rm_std_price, on=['regionCode','model'], how='left')
df = df.merge(rm_med_price, on=['regionCode','model'], how='left')
df = df.merge(mb_avg_price, on=['model','brand'], how='left')
df = df.merge(mb_max_price, on=['model','brand'], how='left')
df = df.merge(mb_min_price, on=['model','brand'], how='left')
df = df.merge(mb_std_price, on=['model','brand'], how='left')
df = df.merge(mb_med_price, on=['model','brand'], how='left')
2.3 对于匿名特征,从他们与price的相关性分析可得出v_3与v_8冗余,即保留一个即可。由于匿名特征无法知道他们与price的关系,所以根据之前的特征重要性对其中几个强相关特征进行处理。处理包括:匿名特征之间的四则运算。
##
```python
def get_feature(data):
data['v_12*v_8']=data['v_12']+data['v_8']
data['v_12+v_8']=data['v_12']*data['v_8']
data['v_12/v_8']=data['v_12']/data['v_8']
data['v_12-v_8']=data['v_12']-data['v_8']
data['v_12*v_0']=data['v_12']+data['v_0']
data['v_12+v_0']=data['v_12']*data['v_0']
data['v_12/v_0']=data['v_12']/data['v_0']
data['v_12-v_0']=data['v_12']-data['v_0']
data['v_0*v_8']=data['v_0']+data['v_8']
data['v_0+v_8']=data['v_0']*data['v_8']
data['v_0/v_8']=data['v_0']/data['v_8']
data['v_0-v_8']=data['v_0']-data['v_8']
data['v_12*v_8*v_0']=data['v_12']+data['v_8']+data['v_0']
data['v_12+v_8+v_0']=data['v_12']*data['v_8']*data['v_0']
data['v_12/v_8/v_0']=data['v_12']/data['v_8']*data['v_0']
data['v_12-v_8-v_0']=data['v_12']-data['v_8']*data['v_0']
return data
(进来了的话,望大佬们看一看最后一段,哈哈)
希望上面的这些特征对大家有启发。这里我我想向大家要一份运行mae(price log处理后的)框架。哪位大佬能提供一下,不胜感激。