写在之前:像一个优秀的工程师一样使用机器学习,而不要像一个机器学习专家一样使用机器学习方法。--Google
这句话一直是我进行机器学习的指导思想。诸如Scikit-learn等机器学习类库,使得机器学习的使用更加简便容易,那我们为什么还要去理解算法本身的数学原理和代码实现呢?在我看来,这就好比华山派的剑宗和气宗,剑宗强调招式,易于速成,气宗强调心法,以不变应万变。调用类库就好比给了你一些解决问题的方法和套路,而明白原理和代码实现就可以帮助你去分析问题,只有将两者结合,才能在实际工程中对不同问题进行分析,再结合方法和套路解决问题,做到无招胜有招。
此篇博客主要讲RandomForest algorithm的python实现,决策树及相关数学原理和其他集成算法的代码实现会在后面的几篇博客中介绍。
在集成学习中,主要分为bagging算法和boosting算法。我们先看看这两种方法的特点和区别。
Bagging(套袋法)
Bagging的算法过程如下:
从原始样本集中使用Bootstraping方法随机抽取n个训练样本,共进行k轮抽取,得到k个训练集。(k个训练集之间相互独立,元素可以有重复)
对于k个训练集,我们训练k个模型(这k个模型可以根据具体问题而定,比如决策树,knn等)
对于分类问题:由投票表决产生分类结果;对于回归问题:由k个模型预测结果的均值作为最后预测结果。(所有模型的重要性相同)
Boosting(提升法)
Boosting的算法过程如下:
对于训练集中的每个样本建立权值wi,表示对每个样本的关注度。当某个样本被误分类的概率很高时,需要加大对该样本的权值。
进行迭代的过程中,每一步迭代都是一个弱分类器。我们需要用某种策略将其组合,作为最终模型。(例如AdaBoost给每个弱分类器一个权值,将其线性组合最为最终分类器。误差越小的弱分类器,权值越大)
Bagging,Boosting的主要区别
样本选择上:Bagging采用的是Bootstrap随机有放回抽样;而Boosting每一轮的训练集是不变的,改变的只是每一个样本的权重。
样本权重:Bagging使用的是均匀取样,每个样本权重相等;Boosting根据错误率调整样本权重,错误率越大的样本权重越大。
预测函数:Bagging所有的预测函数的权重相等;Boosting中误差越小的预测函数其权重越大。
并行计算:Bagging各个预测函数可以并行生成;Boosting各个预测函数必须按顺序迭代生成。
下面是将决策树与这些算法框架进行结合所得到的新的算法:
1)Bagging + 决策树 = 随机森林
2)AdaBoost + 决策树 = 提升树
3)Gradient Boosting + 决策树 = GBDT
随机森林(Random Forests)
随机森林是一种重要的基于Bagging的集成学习方法,可以用来做分类、回归等问题。
与上面介绍的Bagging过程相似,随机森林的构建过程大致如下:
1、从原始训练集中使用Bootstraping方法随机有放回采样选出m个样本,共进行n_tree次采样,生成n_tree个训练集
2、对于n_tree个训练集,我们分别训练n_tree个决策树模型
3、对于单个决策树模型,假设训练样本特征的个数为n,那么每次分裂时根据信息增益/信息增益比/基尼指数选择最好的特征进行分裂
4、每棵树都一直这样分裂下去,直到该节点的所有训练样例都属于同一类。在决策树的分裂过程中不需要剪枝
5、将生成的多棵决策树组成随机森林。对于分类问题,按多棵树分类器投票决定最终分类结果;对于回归问题,由多棵树预测值的均值决定最终预测结果
随机森林有许多优点:
1、具有极高的准确率
2、随机性的引入,使得随机森林不容易过拟合
3、随机性的引入,使得随机森林有很好的抗噪声能力
4、能处理很高维度的数据,并且不用做特征选择
5、既能处理离散型数据,也能处理连续型数据,数据集无需规范化
6、训练速度快,可以得到变量重要性排序
7、容易实现并行化
随机森林的缺点:
当随机森林中的决策树个数很多时,训练时需要的空间和时间会较大
随机森林模型还有许多不好解释的地方,有点算个黑盒模型
下面给出RandomForestClassification的python实现代码:
# coding:utf-8
"""
功能:随机森林,RandomForestClassification,wine数据集[1,2]二分类
"""
from __future__ import division
import pandas as pd
import copy
import random
import math
# 最后一个属性还不能将样本完全分开,此时数量最多的label被选为最终类别
def majorClass(classList):
classDict = {}
for cls in classList:
classDict[cls] = classDict.get(cls, 0) + 1
sortClass = sorted(classDict.items(), key=lambda item: item[1])
return sortClass[-1][0]
# 计算基尼系数
def calcGini(dataSet):
labelCounts = {}
# 给所有可能分类创建字典
for dt in dataSet:
currentLabel = dt[-1]
labelCounts[currentLabel] = labelCounts.get(currentLabel, 0) + 1
Gini = 1
for key in labelCounts:
prob = labelCounts[key] / len(dataSet)
Gini -= prob * prob
return Gini
# 对连续变量划分数据集
def splitDataSet(dataSet, featIndex, value):
leftData, rightData = [], []
for dt in dataSet:
if dt[featIndex] <= value:
leftData.append(dt)
else:
rightData.append(dt)
return leftData, rightData
# 选择最好的数据集划分方式
def chooseBestFeature(dataSet):
bestGini = 1
bestFeatureIndex = -1
bestSplitValue = None
# 第i个特征
for i in range(len(dataSet[0]) - 1):
featList = [dt[i] for dt in dataSet]
# 产生候选划分点
sortfeatList = sorted(list(set(featList)))
splitList = []
for j in range(len(sortfeatList) - 1):
splitList.append((sortfeatList[j] + sortfeatList[j + 1]) / 2)
# 第j个候选划分点,记录最佳划分点
for splitValue in splitList:
newGini = 0
subDataSet0, subDataSet1 = splitDataSet(dataSet, i, splitValue)
newGini += len(subDataSet0) / len(dataSet) * calcGini(subDataSet0)
newGini += len(subDataSet1) / len(dataSet) * calcGini(subDataSet1)
if newGini < bestGini:
bestGini = newGini
bestFeatureIndex = i
bestSplitValue = splitValue
return bestFeatureIndex, bestSplitValue
# 去掉第i个属性,生成新的数据集
def splitData(dataSet, featIndex, features, value):
newFeatures = copy.deepcopy(features)
newFeatures.remove(features[featIndex])
leftData, rightData = [], []
for dt in dataSet:
temp = []
temp.extend(dt[:featIndex])
temp.extend(dt[featIndex + 1:])
if dt[featIndex] <= value:
leftData.append(temp)
else:
rightData.append(temp)
return newFeatures, leftData, rightData
# 建立决策树
def createTree(dataSet, features):
classList = [dt[-1] for dt in dataSet]
# label一样,全部分到一边
if classList.count(classList[0]) == len(classList):
return classList[0]
# 最后一个特征还不能把所有样本分到一边,则选数量最多的label
if len(features) == 1:
return majorClass(classList)
bestFeatureIndex, bestSplitValue = chooseBestFeature(dataSet)
bestFeature = features[bestFeatureIndex]
# 生成新的去掉bestFeature特征的数据集
newFeatures, leftData, rightData = splitData(dataSet, bestFeatureIndex, features, bestSplitValue)
# 左右两颗子树,左边小于等于最佳划分点,右边大于最佳划分点
myTree = {bestFeature: {'<' + str(bestSplitValue): {}, '>' + str(bestSplitValue): {}}}
myTree[bestFeature]['<' + str(bestSplitValue)] = createTree(leftData, newFeatures)
myTree[bestFeature]['>' + str(bestSplitValue)] = createTree(rightData, newFeatures)
return myTree
# 用生成的决策树对测试样本进行分类
def treeClassify(decisionTree, featureLabel, testDataSet):
firstFeature = decisionTree.keys()[0]
secondFeatDict = decisionTree[firstFeature]
splitValue = float(secondFeatDict.keys()[0][1:])
featureIndex = featureLabel.index(firstFeature)
if testDataSet[featureIndex] <= splitValue:
valueOfFeat = secondFeatDict['<' + str(splitValue)]
else:
valueOfFeat = secondFeatDict['>' + str(splitValue)]
if isinstance(valueOfFeat, dict):
pred_label = treeClassify(valueOfFeat, featureLabel, testDataSet)
else:
pred_label = valueOfFeat
return pred_label
# 随机抽取样本,样本数量与原训练样本集一样,维度为sqrt(m-1)
def baggingDataSet(dataSet):
n, m = dataSet.shape
features = random.sample(dataSet.columns.values[:-1], int(math.sqrt(m - 1)))
features.append(dataSet.columns.values[-1])
rows = [random.randint(0, n-1) for _ in range(n)]
trainData = dataSet.iloc[rows][features]
return trainData.values.tolist(), features
def testWine():
df = pd.read_csv('wine.txt', header=None)
labels = df.columns.values.tolist()
df = df[df[labels[-1]] != 3]
# 生成多棵决策树,放到一个list里边
treeCounts = 10
treeList = []
for i in range(treeCounts):
baggingData, bagginglabels = baggingDataSet(df)
decisionTree = createTree(baggingData, bagginglabels)
treeList.append(decisionTree)
print treeList
# 对测试样本分类
labelPred = []
for tree in treeList:
testData = [12, 0.92, 2, 19, 86, 2.42, 2.26, 0.3, 1.43, 2.5, 1.38, 3.12, 278]
label = treeClassify(tree, labels[:-1], testData)
labelPred.append(label)
# 投票选择最终类别
labelDict = {}
for label in labelPred:
labelDict[label] = labelDict.get(label, 0) + 1
sortClass = sorted(labelDict.items(), key=lambda item: item[1])
print "The predicted label is: {}".format(sortClass[-1][0])
testWine()
代码运行结果:
以下是RandomForestRegression的python实现代码:
# coding:utf-8
"""
功能:随机森林,Random Forest(RF),housing数据集回归
"""
from __future__ import division
import pandas as pd
import numpy as np
import copy
import random
import math
# 对连续变量划分数据集,返回数据只包括最后一列
def splitDataSet(dataSet, featIndex, value):
leftData, rightData = [], []
for dt in dataSet:
if dt[featIndex] <= value:
leftData.append(dt[-1])
else:
rightData.append(dt[-1])
return leftData, rightData
# 选择最好的数据集划分方式,使得误差平方和最小
def chooseBestFeature(dataSet):
bestR2 = float('inf')
bestFeatureIndex = -1
bestSplitValue = None
# 第i个特征
for i in range(len(dataSet[0]) - 1):
featList = [dt[i] for dt in dataSet]
# 产生候选划分点
sortfeatList = sorted(list(set(featList)))
splitList = []
# 如果值相同,不存在候选划分点
if len(sortfeatList) == 1:
splitList.append(sortfeatList[0])
else:
for j in range(len(sortfeatList) - 1):
splitList.append((sortfeatList[j] + sortfeatList[j + 1]) / 2)
# 第j个候选划分点,记录最佳划分点
for splitValue in splitList:
subDataSet0, subDataSet1 = splitDataSet(dataSet, i, splitValue)
lenLeft, lenRight = len(subDataSet0), len(subDataSet1)
# 防止数据集为空,mean不能计算
if lenLeft == 0 and lenRight != 0:
rightMean = np.mean(subDataSet1)
R2 = sum([(x - rightMean)**2 for x in subDataSet1])
elif lenLeft != 0 and lenRight == 0:
leftMean = np.mean(subDataSet0)
R2 = sum([(x - leftMean) ** 2 for x in subDataSet0])
else:
leftMean, rightMean = np.mean(subDataSet0), np.mean(subDataSet1)
leftR2 = sum([(x - leftMean)**2 for x in subDataSet0])
rightR2 = sum([(x - rightMean)**2 for x in subDataSet1])
R2 = leftR2 + rightR2
if R2 < bestR2:
bestR2 = R2
bestFeatureIndex = i
bestSplitValue = splitValue
return bestFeatureIndex, bestSplitValue
# 去掉第i个属性,生成新的数据集
def splitData(dataSet, featIndex, features, value):
newFeatures = copy.deepcopy(features)
newFeatures.remove(features[featIndex])
leftData, rightData = [], []
for dt in dataSet:
temp = []
temp.extend(dt[:featIndex])
temp.extend(dt[featIndex + 1:])
if dt[featIndex] <= value:
leftData.append(temp)
else:
rightData.append(temp)
return newFeatures, leftData, rightData
# 建立决策树
def regressionTree(dataSet, features):
classList = [dt[-1] for dt in dataSet]
# label一样,全部分到一边
if classList.count(classList[0]) == len(classList):
return classList[0]
# 最后一个特征还不能把所有样本分到一边,则划分到平均值
if len(features) == 1:
return np.mean(classList)
bestFeatureIndex, bestSplitValue = chooseBestFeature(dataSet)
bestFeature = features[bestFeatureIndex]
# 删除root特征,生成新的去掉root特征的数据集
newFeatures, leftData, rightData = splitData(dataSet, bestFeatureIndex, features, bestSplitValue)
# 左右子树有一个为空,则返回该节点下样本均值
if len(leftData) == 0 or len(rightData) == 0:
return np.mean([dt[-1] for dt in leftData] + [dt[-1] for dt in rightData])
else:
# 左右子树不为空,则继续分裂
myTree = {bestFeature: {'<' + str(bestSplitValue): {}, '>' + str(bestSplitValue): {}}}
myTree[bestFeature]['<' + str(bestSplitValue)] = regressionTree(leftData, newFeatures)
myTree[bestFeature]['>' + str(bestSplitValue)] = regressionTree(rightData, newFeatures)
return myTree
# 用生成的回归树对测试样本进行测试
def treeClassify(decisionTree, featureLabel, testDataSet):
firstFeature = decisionTree.keys()[0]
secondFeatDict = decisionTree[firstFeature]
splitValue = float(secondFeatDict.keys()[0][1:])
featureIndex = featureLabel.index(firstFeature)
if testDataSet[featureIndex] <= splitValue:
valueOfFeat = secondFeatDict['<' + str(splitValue)]
else:
valueOfFeat = secondFeatDict['>' + str(splitValue)]
if isinstance(valueOfFeat, dict):
pred_label = treeClassify(valueOfFeat, featureLabel, testDataSet)
else:
pred_label = valueOfFeat
return pred_label
# 随机抽取样本,样本数量与原训练样本集一样,维度为sqrt(m-1)
def baggingDataSet(dataSet):
n, m = dataSet.shape
features = random.sample(dataSet.columns.values[:-1], int(math.sqrt(m - 1)))
features.append(dataSet.columns.values[-1])
rows = [random.randint(0, n-1) for _ in range(n)]
trainData = dataSet.iloc[rows][features]
return trainData.values.tolist(), features
def testHousing():
df = pd.read_csv('housing.txt')
labels = df.columns.values.tolist()
# 生成多棵回归树,放到一个list里边
treeCounts = 10
treeList = []
for i in range(treeCounts):
baggingData, bagginglabels = baggingDataSet(df)
decisionTree = regressionTree(baggingData, bagginglabels)
treeList.append(decisionTree)
print treeList
# 对测试样本求预测值
labelPred = []
for tree in treeList:
testData = [0.38735,0,25.65,0,0.581,5.613,95.6,1.7572,2,188,19.1,359.29,27.26]
label = treeClassify(tree, labels[:-1], testData)
labelPred.append(label)
print "The predicted value is: {}".format(np.mean(labelPred))
testHousing()
代码运行结果:
相关代码和数据集可以在云盘下载,链接:https://pan.baidu.com/s/10ER6UuA1DOPuDWdKCRRd3g;提取码:47qx。
有关代码和原理方面的疑问,欢迎留言交流,共同学习进步。