

  • 1 多变量决策树简介
  • 2 实现思路
  • 3 代码中的函数说明
    • 3.1 class TreeNode
    • 3.2 trainLinear
    • 3.3 binaryTrainSet
    • 3.4 score
    • 3.5 treeGenerate
    • 3.6 predict
    • 3.7 evaluate
  • 4 完整代码
  • 5 结果

1 多变量决策树简介



图1-1 多变量决策树

2 实现思路

  使用sklearn提供的乳腺癌数据集,从根结点开始,考察每一个结点。当考察结点 n n n时,先用结点 n n n所拥有的数据集 D D D(根结点拥有全部训练数据)拟合出一个线性回归模型 l i n e a r linear linear(回归模型在这里实际上实现的分类功能,也可用其他模型,如逻辑回归)。
  再对数据集 D D D进行划分,将数据集 D D D l i n e a r linear linear的预测输出分类两类(正类和负类,在下面代码中体现为输出小于 0 0 0还是大于 0 0 0)。得到两个集合 D − D^- D D + D^+ D+
  考察集合 D − D^- D,使用 l i n e a r linear linear对其进行评价,若其精度大于等于事先设定的阈值或 D − D^- D为空,则将结点 n n n的左子结点设为叶子节点,类别标记为负类;若精度小于阈值,则将 D − D^- D复制到结点 n n n的左子结点,递归地考察左子结点。
  同理,考察集合 D + D^+ D+,使用 l i n e a r linear linear对其进行评价,若其精度大于等于事先设定的阈值或 D + D^+ D+为空,则将结点 n n n的右子结点设为叶子节点,类别标记为正类;若精度小于阈值,则将 D + D^+ D+复制到结点 n n n的右子结点,递归地考察右子结点。

3 代码中的函数说明

3.1 class TreeNode

class TreeNode(object):
    def __init__(self, model=None, C=None, left=None, right=None):
        self.model = model
        self.C = C
        self.left = left
        self.right = right

  定义结点结构,包含四个变量。 m o d e l model model是结点的线性模型; C C C为结点的类别标记,仅在叶子结点时有意义,对于非叶子节点, C C C N o n e None None l e f t left left是左孩子; r i g h t right right是右孩子。

3.2 trainLinear

def trainLinear(linear, x, y):
    linear.fit(x, y)
    return linear

  使用数据 x x x和标签 y y y训练一个线性模型 l i n e a r linear linear,使用sklearn的最小二乘法进行训练,返回训练好的模型。

3.3 binaryTrainSet

def binaryTrainSet(linear, x, y):
    x0 = []
    x1 = []
    y0 = []
    y1 = []
    p = linear.predict(x)
    for i in range(p.shape[0]):
        if p[i] <= 0:
    return np.array(x0), np.array(x1), np.array(y0), np.array(y1)

  按照线性模型 l i n e a r linear linear预测类别划分数据集。

3.4 score

def score(linear, x, y):
    right = 0
    p = linear.predict(x)
    for i in range(p.shape[0]):
        if p[i]<=0 and y[i]==-1 or p[i]>0 and y[i]==1:
            right += 1
    return right / x.shape[0]

  计算线性模型 l i n e a r linear linear在数据集 x x x上的精度,返回一个位于区间 [ 0 , 1 ] [0,1] [0,1]的浮点数。

3.5 treeGenerate

def treeGenerate(root, x, y, precision):
    root.model = LinearRegression()
    root.model = trainLinear(root.model, x, y)
    x0, x1, y0, y1 = binaryTrainSet(root.model, x, y)
    if len(x0)==0 or score(root.model, x0, y0)>= precision:
        root.left = TreeNode(C=-1)
        root.left = TreeNode()
        treeGenerate(root.left, x0, y0, precision)
    if len(x1)==0 or score(root.model, x1, y1) >= precision:
        root.right = TreeNode(C=1)
        root.right = TreeNode()
        treeGenerate(root.right, x1, y1, precision)

  递归地构建多变量决策树,此函数为代码核心部分, r o o t root root为决策树根结点, x x x y y y是训练数据和标签, p r e c i s i o n precision precision是事前设定的阈值。

3.6 predict

def predict(root, xs):
    if root.C is not None:
        return root.C
        if root.model.predict(np.expand_dims(xs, axis=0)) <= 0:
            return predict(root.left, xs)
            return predict(root.right, xs)

  使用构建完成的多变量决策树预测一个样本, r o o t root root为决策树根结点, x s xs xs为样本特征,是一个一维 n u m p y numpy numpy数组,返回样本类别。

3.7 evaluate

def evaluate(root, x, y):
    right = 0
    for i in range(x.shape[0]):
        if predict(root, x[i]) == y[i]:
            right += 1
    return right / x.shape[0]

  计算以 r o o t root root为根结点的多变量决策树在数据集 x x x上的精度, y y y是与样本特征 x x x所对应的标签。

4 完整代码

# -*- coding: utf-8 -*-
Created on Tue Nov 24 17:13:46 2020

@author: qiqi

import numpy as np
from sklearn import preprocessing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer

class TreeNode(object):
    def __init__(self, model=None, C=None, left=None, right=None):
        self.model = model
        self.C = C
        self.left = left
        self.right = right

def trainLinear(linear, x, y):
    linear.fit(x, y)
    return linear

def binaryTrainSet(linear, x, y):
    x0 = []
    x1 = []
    y0 = []
    y1 = []
    p = linear.predict(x)
    for i in range(p.shape[0]):
        if p[i] <= 0:
    return np.array(x0), np.array(x1), np.array(y0), np.array(y1)

def score(linear, x, y):
    right = 0
    p = linear.predict(x)
    for i in range(p.shape[0]):
        if p[i]<=0 and y[i]==-1 or p[i]>0 and y[i]==1:
            right += 1
    return right / x.shape[0]
def treeGenerate(root, x, y, precision):
    root.model = LinearRegression()
    root.model = trainLinear(root.model, x, y)
    x0, x1, y0, y1 = binaryTrainSet(root.model, x, y)
    if len(x0)==0 or score(root.model, x0, y0)>= precision:
        root.left = TreeNode(C=-1)
        root.left = TreeNode()
        treeGenerate(root.left, x0, y0, precision)
    if len(x1)==0 or score(root.model, x1, y1) >= precision:
        root.right = TreeNode(C=1)
        root.right = TreeNode()
        treeGenerate(root.right, x1, y1, precision)

def predict(root, xs):
    if root.C is not None:
        return root.C
        if root.model.predict(np.expand_dims(xs, axis=0)) <= 0:
            return predict(root.left, xs)
            return predict(root.right, xs)

def evaluate(root, x, y):
    right = 0
    for i in range(x.shape[0]):
        if predict(root, x[i]) == y[i]:
            right += 1
    return right / x.shape[0]

if __name__ == '__main__':
    cancer = load_breast_cancer()

    X_train, X_test, y_train, y_test = train_test_split(cancer['data'],cancer['target'], test_size=0.33, random_state=42)
    y_train[y_train == 0] = -1
    y_test[y_test == 0] = -1

    X_train = preprocessing.scale(X_train)
    X_test = preprocessing.scale(X_test)
    root = TreeNode()
    treeGenerate(root, X_train, y_train, 0.96)
    scoreTrain = evaluate(root, X_train, y_train)
    scoreTest = evaluate(root, X_test, y_test)
    print('训练集精度为:', round(scoreTrain,4))
    print('测试集精度为:', round(scoreTest, 4))

5 结果

  最终生成的多变量决策树的训练集和测试集精度分别为 0.979 0.979 0.979 0.9628 0.9628 0.9628。实际上,由于乳腺癌数据集比较简单,生成的多变量决策树的深度很小,其所具有的强大数据拟合能力并没有完全发挥出来。
