

在R中查找boosting函数的帮助文档,发现了一篇关于AdaBoost算法的开发资料[1],里面提到了“The package adabag 3.2, available from de Comprehesive R Archive Network at http://CRAN.R-project.org/package=adabag, is the current update of adabag that changes the measure of relative importance of the predictor variables using the gain of the Gini index given by a variable in a tree and, in the case of the boosting function, the weight of this tree. For this goal, the varImp function of the caret package (Kuhn 2008, 2012) is used to get the gain of the Gini index of the variables in each tree.”简言之,各变量的importance是通过每棵树中各变量的基尼指数与每棵树的权重计算得来的,在R中varImp函数可以计算每棵树中各变量的基尼指数。


找到差异原因需要深入了解boosting函数计算变量importance的方法。于是,下载了boosting函数的源码来看,发现调用varImp函数计算importance时设置了两个参数varImp(a$trees[[1]], surrogates = FALSE, competes = FALSE),说明文档关于这两个参数为FALSE时的介绍是将代表和竞争忽略,在此不做深入阐述。是否设置这两个参数得到的计算结果是不同的。
关于AdaBoost算法计算变量重要度的理解_第1张图片 关于AdaBoost算法计算变量重要度的理解_第2张图片


# train AdaBoost model
a<-boosting(class~.,data_train,coeflearn = 'Zhu',boos=F)   ##method is different,boos if different
# trees amount
mfinal <- 100
# parameters amount
nvar <- 6
imp <- array(0, c(mfinal, nvar))
pond <- a$weights
for(i in 1:mfinal){
  k <- varImp(a$trees[[i]], surrogates = FALSE, competes = FALSE)
  imp[i, ] <- k[sort(row.names(k)), ]
imppond <- as.vector(as.vector(pond) %*% imp)
imppond <- imppond/sum(imppond) * 100
names(imppond) <- sort(row.names(k))
importance <- imppond

Alfaro, E., Gamez, M. and Garcia, N. (2013): “adabag: An R Package for Classification with Boosting and Bagging”. Journal of Statistical Software, Vol 54, 2, pp. 1–35.
