随机森林这个内容,是目前来说。。。最最最简单,最好理解,应该也是最好实现的了!!!
先挖坑,慢慢填
随机森林,这个名字取得,果然深得该算法的核心精髓,既随机,又森林!
哇哦,以后如果要给阿猫阿狗取名,或是生个小孩儿取名,也最好是能参考随机森林的精髓
从名字来拆解随机森林的算法精髓。
首先是随机,随机地抽样+随机地选特征
其次是森林,为什么是森林呢?
妙了,因为算法的基本单元是一颗决策树
随机森林,其实就是由多个决策树进行预测分类,每棵决策树都有一个预测分类结果,那么采取少数服从多数的原则,
也就是,如果有A\B\C三个类别,绝大多数决策预测是A类,少部分决策树预测是B\C类,则最终判定为A类
之前已经做过决策树的设计,现在只需在决策树的基础上,进行些微的代码修改
首先,决策树作为一个类,生成每个决策树,就生成一个对象
每个决策树对象,都有各自随机抽取的数据量(样本)、预测结果
循环一定次数:建立多少棵树,就循环多少次
随机获取一定数量的特征属性
随机获取一定数量的样本数据
创建一个决策树对象
构建该对象的决策树
应用该决策树对象,预测整个数据集分类结果
汇总所有决策树对象的预测结果,投票表决
import math
import numpy as np
import pandas as pd
import random
# 获取所需数据
datas = pd.read_excel('./datas1.xlsx')
important_features = ['推荐类型','推荐分值', '回复速度']
datas_1 = datas[important_features]
Y = datas_1['推荐类型']
X = datas_1.drop('推荐类型',axis=1)
Y_feature = "推荐类型"
# 构建一个树节点
class Node_1():
def __init__(self,value):
self.value = value
self.select_feat = None
self.sons = {}
# 根据节点,构建一个树
class Tree():
def __init__(self,datas_arg):
self.root = None
self.datas = datas_arg
self.Y_predict = []
self.X = datas_arg.drop('推荐类型', axis=1)
def get_value_1(self,datas_arg,node_arg=None):
# 明确当前节点数据
node = node_arg
if self.root == None:
node = Node_1(datas_arg)
self.root = node
# 明确当前节点的划分特征、子节点们: 计算各特征划分后的信息增益,并选出信息增益最大的特征
gain_dicts = {}
for i in self.X.columns:
groups = datas_arg.groupby(i)
groups = [groups.get_group(j) for j in set(datas_arg[i])]
if len(groups) > 1: # 特征可分
gain_dicts[i] = self.get_gain(datas_arg,groups,Y_feature)
# 明确停止划分的条件,即停止迭代的条件:无可划分的属性,或是最大的条件熵为0
if (not gain_dicts) or max(gain_dicts.values()) == 0:
return
select_feat = max(gain_dicts,key=lambda x:gain_dicts[x])
node.select_feat = select_feat
group_feat = datas_arg.groupby(select_feat)
for j in set(datas_arg[select_feat]):
node_son_value = group_feat.get_group(j)
node_son = Node_1(node_son_value)
node.sons[j] = node_son
for key,node_single in node.sons.items():
self.get_value_1(node_single.value,node_single)
# 获取熵
def get_ent(self,datas,feature):
p_values = datas[feature].value_counts(normalize=True)
p_updown = 1/p_values
ent = (p_values*(p_updown).apply(np.log2)).sum()
return ent
# 获取条件熵
def get_condition_ent(self,datas_list,feature):
proportions = [len(i) for i in datas_list]
proportions = [i/sum(proportions) for i in proportions]
ents = [self.get_ent(i,feature) for i in datas_list]
condition_ent = np.multiply(ents,proportions).sum()
return condition_ent
# 获取信息增益
def get_gain(self,datas_all,datas_group,feature):
condition_ent = self.get_condition_ent(datas_group,feature)
ent_all = self.get_ent(datas_all,feature)
gain = ent_all - condition_ent
return gain
# 探访决策树,并进行预测分类
def predict(self,data,root):
if not root.select_feat:
p_values = root.value[Y_feature].value_counts(normalize=True)
self.Y_predict.append(p_values.idxmax())
return
feat = root.select_feat
try:
if data[feat] not in root.sons.keys():
self.Y_predict.append(None)
return
next_node = root.sons[data[feat]]
except:
print(data)
print(root.sons)
raise Exception("错了")
self.predict(data,next_node)
def pre_print(self, root):
if root is None:
return
for key,node_son in root.sons.items():
self.pre_print(node_son)
def func(self,data):
self.predict(data,self.root)
max_tree_num = 10
max_feat_num = 3
max_data_num = 100
Y_feature = "推荐类型"
data_index_list = [i for i in range(0,len(datas_1)-1)]
feat_index_list = [i for i in range(0,len(important_features)-1)]
tree_list = []
all_Y_predict = []
# 循环一定次数:建立多少棵树,就循环多少次
# 随机获取一定数量的特征属性
# 随机获取一定数量的样本数据
for i in range(max_tree_num):
data_index = random.sample(data_index_list, max_data_num-1)
feat_index = random.sample(feat_index_list, max_feat_num-1)
temp_feat = [important_features[index] for index in feat_index]
temp1 = datas[temp_feat]
temp_datas = pd.DataFrame([temp1.iloc[index] for index in data_index])
# 创建一棵树
tree = Tree(temp_datas)
# break
tree.get_value_1(tree.datas)
datas_1.apply(tree.func,axis=1)
all_Y_predict.append(tree.Y_predict)
all_Y_predict = pd.DataFrame(all_Y_predict)
result = all_Y_predict.apply(pd.Series.value_counts)
Y_predict = result.idxmax() # 打印列最大值的行索引
accurency = sum(Y_predict==Y)/len(Y)
print(f"分类准确率:{accurency*100}%")