机器学习算法实现(5):手写随机森林代码

【系列开头】开这个系列是因为最近学习某些算法纸上谈兵太久,算法流程背的再熟,没有实现过就没法真正的理解算法的细节。这个系列要实现算法的顺序为逻辑回归、决策树(CART)、AdaBoost、GBDT。其他算法根据后续学习情况进行添加。

这是添加的第一个算法,与之前的Boosting同属于提升算法的一种,Bagging算法中的随机森林。随机森林,顾名思义,就是随机特征+森林般多的树。

实现代码也是很简单,只要决策树实现好了,手写一个随机森林也是分分钟的事。

分类:

  1. 初始化
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from scipy import stats

class RandomForestClassifier:
    def __init__(self, n_estimators=5, min_samples_split=5, min_samples_leaf=5, min_impurity_decrease=0.0):
        self.n_estimators = n_estimators
        self.min_samples_split = min_samples_split
        self.min_samples_leaf = min_samples_leaf
        self.min_impurity_decrease = min_impurity_decrease
        self.estimators_ = []
  1. 特征和样本随机采样
    def __RandomPatches(self, data):
        n_samples, n_features = data.shape
        n_features -= 1
        sub_data = np.copy(data)
        random_f_idx = np.random.choice(
            n_features, size=int(np.sqrt(n_features)), replace=False
        )
        mask_f_idx = [i for i in range(n_features) if i not in random_f_idx]
        random_data_idx = np.random.choice(n_samples, size=n_samples, replace=True)
        sub_data = sub_data[random_data_idx]
        sub_data[:, mask_f_idx] = 0
        return sub_data
  1. 构建随机森林
    def __RF_Clf(self, data):
        for _ in range(self.n_estimators):
            tree = DecisionTreeClassifier(min_samples_split=self.min_samples_split,
                                          min_samples_leaf=self.min_samples_leaf,
                                          min_impurity_decrease=self.min_impurity_decrease)
            sub_data = self.__RandomPatches(data)
            tree.fit(sub_data[:, :-1], sub_data[:, -1])
            self.estimators_.append(tree)
  1. 拟合与预测
    def fit(self, X_train, y_train):
        data = np.c_[X_train, y_train]
        self.__RF_Clf(data)
        del data

    def predict(self, X_test):
        raw_pred = np.array([tree.predict(X_test) for tree in self.estimators_]).T
        return np.array([stats.mode(y_pred)[0][0] for y_pred in raw_pred])

回归:

待续……

你可能感兴趣的:(炼丹笔记)