机器学习——随机森林【手动代码】

随机森林这个内容,是目前来说。。。最最最简单,最好理解,应该也是最好实现的了!!!

先挖坑,慢慢填

随机森林,这个名字取得,果然深得该算法的核心精髓,既随机,又森林!
哇哦,以后如果要给阿猫阿狗取名,或是生个小孩儿取名,也最好是能参考随机森林的精髓

从名字来拆解随机森林的算法精髓。

首先是随机,随机地抽样+随机地选特征

其次是森林,为什么是森林呢?

妙了,因为算法的基本单元是一颗决策树

随机森林,其实就是由多个决策树进行预测分类,每棵决策树都有一个预测分类结果,那么采取少数服从多数的原则,

也就是,如果有A\B\C三个类别,绝大多数决策预测是A类,少部分决策树预测是B\C类,则最终判定为A类

之前已经做过决策树的设计,现在只需在决策树的基础上,进行些微的代码修改

首先,决策树作为一个类,生成每个决策树,就生成一个对象

每个决策树对象,都有各自随机抽取的数据量(样本)、预测结果

循环一定次数:建立多少棵树,就循环多少次
    随机获取一定数量的特征属性
    随机获取一定数量的样本数据
    创建一个决策树对象
    构建该对象的决策树
    应用该决策树对象,预测整个数据集分类结果
汇总所有决策树对象的预测结果,投票表决
import math
import numpy as np
import pandas as pd
import random
# 获取所需数据
datas = pd.read_excel('./datas1.xlsx')
important_features = ['推荐类型','推荐分值', '回复速度']
datas_1 = datas[important_features]
Y = datas_1['推荐类型']
X = datas_1.drop('推荐类型',axis=1)
Y_feature = "推荐类型"

# 构建一个树节点
class Node_1():
    def __init__(self,value):
        self.value = value
        self.select_feat = None
        self.sons = {}
# 根据节点,构建一个树
class Tree():
    def __init__(self,datas_arg):
        self.root = None
        self.datas = datas_arg
        self.Y_predict = []
        self.X = datas_arg.drop('推荐类型', axis=1)
    def get_value_1(self,datas_arg,node_arg=None):
        # 明确当前节点数据
        node = node_arg
        if self.root == None:
            node = Node_1(datas_arg)
            self.root = node
        # 明确当前节点的划分特征、子节点们: 计算各特征划分后的信息增益,并选出信息增益最大的特征
        gain_dicts = {}
        for i in self.X.columns:
            groups = datas_arg.groupby(i)
            groups = [groups.get_group(j) for j in set(datas_arg[i])]
            if len(groups) > 1:  # 特征可分
                gain_dicts[i] = self.get_gain(datas_arg,groups,Y_feature)
        # 明确停止划分的条件,即停止迭代的条件:无可划分的属性,或是最大的条件熵为0
        if (not gain_dicts) or max(gain_dicts.values()) == 0:
            return
        select_feat = max(gain_dicts,key=lambda x:gain_dicts[x])
        node.select_feat = select_feat
        group_feat = datas_arg.groupby(select_feat)
        for j in set(datas_arg[select_feat]):
            node_son_value = group_feat.get_group(j)
            node_son = Node_1(node_son_value)
            node.sons[j] = node_son
        for key,node_single in node.sons.items():
            self.get_value_1(node_single.value,node_single)
    # 获取熵
    def get_ent(self,datas,feature):
        p_values = datas[feature].value_counts(normalize=True)
        p_updown = 1/p_values
        ent = (p_values*(p_updown).apply(np.log2)).sum()
        return ent
    # 获取条件熵
    def get_condition_ent(self,datas_list,feature):
        proportions = [len(i) for i in datas_list]
        proportions = [i/sum(proportions) for i in proportions]
        ents = [self.get_ent(i,feature) for i in datas_list]
        condition_ent = np.multiply(ents,proportions).sum()
        return condition_ent
    # 获取信息增益
    def get_gain(self,datas_all,datas_group,feature):
        condition_ent = self.get_condition_ent(datas_group,feature)
        ent_all = self.get_ent(datas_all,feature)
        gain = ent_all - condition_ent
        return gain
    # 探访决策树,并进行预测分类
    def predict(self,data,root):

        if not root.select_feat:
            p_values = root.value[Y_feature].value_counts(normalize=True)

            self.Y_predict.append(p_values.idxmax())
            return

        feat = root.select_feat
        try:
            if data[feat] not in root.sons.keys():
                self.Y_predict.append(None)
                return
            next_node = root.sons[data[feat]]
        except:
            print(data)
            print(root.sons)
            raise Exception("错了")
        self.predict(data,next_node)

    def pre_print(self, root):
        if root is None:
            return
        for key,node_son in root.sons.items():
            self.pre_print(node_son)
    def func(self,data):
        self.predict(data,self.root)

max_tree_num = 10
max_feat_num = 3
max_data_num = 100
Y_feature = "推荐类型"

data_index_list = [i for i in range(0,len(datas_1)-1)]
feat_index_list = [i for i in range(0,len(important_features)-1)]

tree_list = []
all_Y_predict = []
# 循环一定次数:建立多少棵树,就循环多少次
    # 随机获取一定数量的特征属性
    # 随机获取一定数量的样本数据
for i in range(max_tree_num):
    data_index = random.sample(data_index_list, max_data_num-1)
    feat_index = random.sample(feat_index_list, max_feat_num-1)
    temp_feat = [important_features[index] for index in feat_index]
    temp1 = datas[temp_feat]
    temp_datas = pd.DataFrame([temp1.iloc[index] for index in data_index])
    # 创建一棵树
    tree = Tree(temp_datas)
    # break
    tree.get_value_1(tree.datas)
    datas_1.apply(tree.func,axis=1)
    all_Y_predict.append(tree.Y_predict)
all_Y_predict = pd.DataFrame(all_Y_predict)
result = all_Y_predict.apply(pd.Series.value_counts)
Y_predict = result.idxmax()   # 打印列最大值的行索引

accurency = sum(Y_predict==Y)/len(Y)
print(f"分类准确率:{accurency*100}%")

你可能感兴趣的:(统计学习,机器学习基础,机器学习,人工智能,随机森林)