一、正则化
1.L1/Lasso
L1正则方法具有稀疏解的特性,因此天然具备特征选择的特性,但是要注意,L1没有选到的特征不代表不重要,原因是两个具有高相关性的特征可能只保留了一个,如果要确定哪个特征重要应再通过L2正则方法交叉检验。
举例:下面的例子在波士顿房价数据上运行了Lasso,其中参数alpha是通过grid search进行优化
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_boston
boston = load_boston()
scaler = StandardScaler()
X = scaler.fit_transform(boston["data"])
Y = boston["target"]
names = boston["feature_names"]
lasso = Lasso(alpha=.3)
lasso.fit(X, Y)
print "Lasso model: ", pretty_print_linear(lasso.coef_, names, sort = True)
可以看到,很多特征的系数都是0。如果继续增加alpha的值,得到的模型就会越来越稀疏,即越来越多的特征系数会变成0。然而,L1正则化像非正则化线性模型一样也是不稳定的,如果特征集合中具有相关联的特征,当数据发生细微变化时也有可能导致很大的模型差异。
2.L2/Ridge
举例:
from sklearn.linear_model import Ridge
from sklearn.metrics import r2_score
size = 100
#We run the method 10 times with different random seeds
for i in range(10):
print("Random seed %s" % i)
np.random.seed(seed=i)
X_seed = np.random.normal(0, 1, size)
X1 = X_seed + np.random.normal(0, .1, size)
X2 = X_seed + np.random.normal(0, .1, size)
X3 = X_seed + np.random.normal(0, .1, size)
Y = X1 + X2 + X3 + np.random.normal(0, 1, size)
X = np.array([X1, X2, X3]).T
lr = LinearRegression()
lr.fit(X,Y)
print("Linear model:", pretty_print_linear(lr.coef_))
ridge = Ridge(alpha=10)
ridge.fit(X,Y)
print("Ridge model:", pretty_print_linear(ridge.coef_))
二、基于树模型的特征重要性
1.RF
2.ExtraTree
3.Adaboost
4.GBDT
5.XGboost
get_score(fmap='', importance_type='weight')
fmap是一个包含特征名称映射关系的txt文档; importance_type指importance的计算类型;可取值有5个:
‘weight’: the number of times a feature is used to split the data across all trees.
‘gain’: the average gain across all splits the feature is used in.
‘cover’: the average coverage across all splits the feature is used in.
‘total_gain’: the total gain across all splits the feature is used in.
‘total_cover’: the total coverage across all splits the feature is used in.
[1]importance_type=weight(默认值),某特征在所有树中作为划分属性的次数(某特征在整个树群节点中出现的次数,出现越多,价值就越高)
[2]importance_type=gain,某特征在作为划分属性时loss平均的降低量(某特征在整个树群作为分裂节点的信息增益之和再除以某特征出现的频次)
[3] importance_type= cover,某特征在作为划分属性时对样本的覆盖度(某特征节点样本的二阶导数和再除以某特征出现总频次)[4]importance_type=total_gain,同gain,average_over_splits=False,这里total_gain就是除以出现次数的gain
[5]importance_type=total_cover,同cover,average_over_splits=False,这里total_cover就是除以出现次数的gain
从构造函数中发现,xgboost sklearn API在计算特征重要性的时候默认importance_type="gain",而原始的get_score方法默认importance_type="weight"
def __init__(self, max_depth=3, learning_rate=0.1, n_estimators=100,
verbosity=1, silent=None, objective="reg:linear", booster='gbtree',
n_jobs=1, nthread=None, gamma=0, min_child_weight=1,
max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1,
colsample_bynode=1, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
base_score=0.5, random_state=0, seed=None, missing=None,
# 在这一步进行了声明
importance_type="gain", **kwargs):
6.LightGBM
7.RF、Xgboost、ExtraTree每个选出topk特征,再进行融合
from sklearn import ensemble
from sklearn.model_selection import GridSearchCV
import xgboost as xgb
def get_top_k_feature(features,model,top_n_features):
feature_imp_sorted_rf = pd.DataFrame({'feature':features,'importance':model.best_estimator_.feature_importances_}).sort_values('importance',ascending='False')
features_top_n = feature_imp_sorted_rf.head(top_n_features)['feature']
return features_top_n
def ensemble_model_feature(X,Y,top_n_features):
features = list(X)
#随机森林
rf = ensemble.RandomForestRegressor()
rf_param_grid = {'n_estimators':[900],'random_state':[2,4,6,8]}
rf_grid = GridSearchCV(rf,rf_param_grid,cv=10,verbose=1,n_jobs=25)
rf_grid.fit(X,Y)
top_n_features_rf = get_top_k_feature(features=features,model=rf_grid,top_n_features=top_n_features)
print('RF 选择完毕')
#Adaboost
abr = ensemble.AdaBoostRegressor()
abr_grid = GridSearchCV(abr,rf_param_grid,cv=10,n_jobs=25)
abr_grid.fit(X,Y)
top_n_features_bgr = get_top_k_feature(features=features,model=abr_grid,top_n_features=top_n_features)
print('Adaboost选择完毕')
#ExtraTree
etr = ensemble.ExtraTreesRegressor()
etr_grid = GridSearchCV(etr,rf_param_grid,cv=10,n_jobs=25)
etr_grid.fit(X,Y)
top_n_features_etr = get_top_k_feature(features=features,model=etr_grid,top_n_features=top_n_features)
print('ExtraTree选择完毕')
#融合以上3个模型
features_top_n = pd.concat([top_n_features_rf,top_n_features_bgr,top_n_features_etr],ignore_index=True).drop_duplicates()
print(features_top_n)
print(len(features_top_n))
return features_top_n
参考文献:
【1】树模型特征重要性评估方法
【2】用xgboost模型对特征重要性进行排序
【3】xgboost特征重要性源代码
【4】机器学习的特征重要性究竟是怎么算的(知乎)
【5】特征工程