数据竞赛(三)特征构造与特征选择

"数据和特征决定了机器学习的上限,而模型和算法只是逼近这个上限而已。" 特征构造与特征选择十分重要,本文学习了一些特征构造和选择方法。

一、特征构造

    # 合并车站
    data['bus_sub_num'] = data['subwayStationNum']+data['busStationNum']
    # 合并学校
    data['school_num'] = data['interSchoolNum']+data['schoolNum']+data['privateSchoolNum']
    # 合并医疗
    data['help_sum'] = data['hospitalNum']+data['drugStoreNum']
    # 合并生活设施
    data['play_sum'] = data['gymNum']+data['parkNum']+data['bankNum']
    # 合并购物
    data['shop_num'] = data['shopNum']+data['mallNum']+data['superMarketNum']
    # 其他合并
    data['totalNewTradeMoney_Workers'] = data['totalNewTradeMoney'] + data['totalWorkers']
    data['bankNum_Workers'] = data['bankNum'] + data['totalWorkers']
    data['gym_bankNum'] = data['bankNum'] + data['gymNum']
    # "板块二手房价"
    data['area_mean_price'] = (data['area']*data['tradeMeanPrice'])/1000
    # "板块新房房价"
    data['New_area_mean_price'] = (data['area']*data['tradeNewMeanPrice'])/1000

二、特征选择

1. Filter

(1)信息增益
(2)相关系数
(3)卡方检验

from sklearn.feature_selection import SelectKBest,SelectPercentile
from sklearn.feature_selection import chi2

X_new = SelectKBest(chi2, k=43).fit(X, y).get_support(indices = True)

2. Wrapper

(1)递归特征消除法(RFE)

from sklearn.feature_selection import RFE

from sklearn.linear_model import LinearRegression
lr = LinearRegression()
rfe = RFE(lr, n_features_to_select=40)
X = train_data.drop(["tradeMoney"],axis=1)
y = train_data["tradeMoney"]
rfe.fit(X,y)
rfe.ranking_,rfe.n_features_,rfe.support_
sel_features = [f for f, s in zip(X_columns, rfe.support_) if s]

3. Embedded
(1)基于惩罚项的特征选择法

from sklearn.linear_model import Ridge

ridge = Ridge(alpha=5)
ridge.fit(X,y)

coefSort = ridge.coef_.argsort()
featureCoefSore=ridge.coef_[coefSort]  
X_columns[coefSort]
sel_features = [f for f, s in zip(X_columns, featureCoefSore) if abs(s)> 2 ]
train = train_data[sel_features]
test = test_data[sel_features]

(2)基于树模型的特征选择法

(3)随机森林 平均不纯度减少(mean decrease impurity)

from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor()
# 训练随机森林模型,并通过feature_importances_属性获取每个特征的重要性分数。rf = RandomForestRegressor()
rf.fit(X, y)
print("Features sorted by their score:")
print(sorted(zip(map(lambda x: round(x, 4), rf.feature_importances_), X_columns),
             reverse=True))
sel_features = [f for f, s in zip(X_columns, rf.feature_importances_) if abs(s)> 0.001 ] # 选择绝对值大于二的特征

 

你可能感兴趣的:(数据竞赛)