机器学习任务中,在正式训练模型之前,我们一般会从原始数据中尽可能多的提取特征,作为模型的输入。
但是特征也不是越多越好,就像《赌神》里的戒指梗:高进在最近500盘赌局中偷鸡时喜欢摸戒指,让对手误以为发现了他的习惯:一摸戒指就要偷鸡,结果在比赛中上当了。摸戒指就是一个不好的特征,把它纳入模型中,反而会使模型变差。
本文介绍的Boruta算法就是一种特征选择方法,使用特征的重要性来选取特征。boruta_py是Boruta算法的python实现,类似于sklearn的扩展,用起来很方便。
Boruta的主要步骤如下:
1. 创建阴影特征 (shadow feature) : 对每个真实特征R,随机打乱顺序,得到阴影特征矩阵S,拼接到真实特征后面,构成新的特征矩阵N = [R, S].
2. 用新的特征矩阵N作为输入,训练模型,能输出feature_importances_的模型,如RandomForest, lightgbm,xgboost都可以,得到真实特征和阴影特征的feature importances,
3. 取阴影特征feature importance的最大值S_max,真实特征中feature importance大于S_max的,记录一次命中。
4. 用(3)中记录的真实特征累计命中,标记特征重要或不重要。原论文中用Bonferroni校正作显著性检验,boruta_py认为Bonferroni校正太过保守,默认增加了FDR校正,用two_step参数可以切换两种检验方法。
5. 删除不重要的特征,重复1-4,直到所有特征都被标记。
训练结束后,boruta_py 还可以输出特征ranking_,表示特征的重要性等级,在特征选择中也是一个很有用的指标。
下面结合代码看一下两种检验的区别,boruta_py的特征检验在实现如下。
def _do_tests(self, dec_reg, hit_reg, _iter):
active_features = np.where(dec_reg >= 0)[0]
hits = hit_reg[active_features]
# get uncorrected p values based on hit_reg
to_accept_ps = sp.stats.binom.sf(hits - 1, _iter, .5).flatten()
to_reject_ps = sp.stats.binom.cdf(hits, _iter, .5).flatten()
if self.two_step:
# two step multicor process
# first we correct for testing several features in each round using FDR
to_accept = self._fdrcorrection(to_accept_ps, alpha=self.alpha)[0]
to_reject = self._fdrcorrection(to_reject_ps, alpha=self.alpha)[0]
# second we correct for testing the same feature over and over again
# using bonferroni
to_accept2 = to_accept_ps <= self.alpha / float(_iter)
to_reject2 = to_reject_ps <= self.alpha / float(_iter)
# combine the two multi corrections, and get indexes
to_accept *= to_accept2
to_reject *= to_reject2
else:
# as in th original Boruta, we simply do bonferroni correction
# with the total n_feat in each iteration
to_accept = to_accept_ps <= self.alpha / float(len(dec_reg))
to_reject = to_reject_ps <= self.alpha / float(len(dec_reg))
# find features which are 0 and have been rejected or accepted
to_accept = np.where((dec_reg[active_features] == 0) * to_accept)[0]
to_reject = np.where((dec_reg[active_features] == 0) * to_reject)[0]
# updating dec_reg
dec_reg[active_features[to_accept]] = 1
dec_reg[active_features[to_reject]] = -1
return dec_reg
def _fdrcorrection(self, pvals, alpha=0.05):
"""
Benjamini/Hochberg p-value correction for false discovery rate, from
statsmodels package. Included here for decoupling dependency on statsmodels.
Parameters
----------
pvals : array_like
set of p-values of the individual tests.
alpha : float
error rate
Returns
-------
rejected : array, bool
True if a hypothesis is rejected, False if not
pvalue-corrected : array
pvalues adjusted for multiple hypothesis testing to limit FDR
"""
pvals = np.asarray(pvals)
pvals_sortind = np.argsort(pvals)
pvals_sorted = np.take(pvals, pvals_sortind)
nobs = len(pvals_sorted)
ecdffactor = np.arange(1, nobs + 1) / float(nobs)
reject = pvals_sorted <= ecdffactor * alpha
if reject.any():
rejectmax = max(np.nonzero(reject)[0])
reject[:rejectmax] = True
pvals_corrected_raw = pvals_sorted / ecdffactor
pvals_corrected = np.minimum.accumulate(pvals_corrected_raw[::-1])[::-1]
pvals_corrected[pvals_corrected > 1] = 1
# reorder p-values and rejection mask to original order of pvals
pvals_corrected_ = np.empty_like(pvals_corrected)
pvals_corrected_[pvals_sortind] = pvals_corrected
reject_ = np.empty_like(reject)
reject_[pvals_sortind] = reject
return reject_, pvals_corrected_
其中 dec_reg 中用 1/0/-1 分别表示特征 接受/待定/拒绝 三种状态,hit_reg 存储了多次实验后每个特征的累计hit数,_iter为已经进行的实验次数,self.alpha参数是显著水平,默认值0.05.
Bonferroni校正:如果在同一数据集上同时检验n个独立的假设,那么用于每一假设的统计显著水平,应为仅检验一个假设时的显著水平的1/n,所以在比较时用self.alpha 除以 float(len(dec_reg))。
需要注意的是,boruta_py做特征选择时,会检查特征矩阵中是否包含NAN, INF这样的无效值,如果用lightgbm这样可以处理NAN值的模型来输出feature importances,可以把这一步去掉。
参考
Kursa M B, Rudnicki W R. Feature Selection with the Boruta Package[J]. Journal of Statistical Software, 2010, 36(11):1-13.