Input:训练集 D = ( x 1 , y 1 ) , ( x 2 , y 2 ) , ⋯ ( x m , y m ) D = {(x_1, y_1),(x_2,y_2),\cdots (x_m, y_m)} D=(x1,y1),(x2,y2),⋯(xm,ym) ,属性集 A = a 1 , a 2 , a 3 , ⋯ a d A={a_1, a_2, a_3, \cdots a_d} A=a1,a2,a3,⋯ad,阀值 ϵ \epsilon ϵ
Output:决策树T
以香农熵作为基准的Python代码(机器学习项目实战代码)
from math import log
import operator
def cal_shannon_ent(data_set):
num_entries = len(data_set)
label_counts = {}
for feat_vec in data_set:
current_label = feat_vec[-1]
if current_label not in label_counts.keys():
label_counts[current_label] = 0;
label_counts[current_label] += 1;
shannon_ent = 0.0;
for key in label_counts:
pro = float(label_counts[key])/num_entries
shannon_ent -= pro * log(pro, 2)
return shannon_ent
def split_data_set(data_set, axis, value):
ret_data_set = []
for feat_vec in data_set:
if feat_vec[axis] == value:
reduced_feat_vec = feat_vec[:axis]
reduced_feat_vec.extend(feat_vec[axis+1:])
ret_data_set.append(reduced_feat_vec)
return ret_data_set
def choose_best_feature_to_split(data_set):
num_features = len(data_set[0]) - 1
base_entropy = cal_shannon_ent(data_set)
best_info_gain = 0.0
best_feature = -1
for i in range(num_features):
feature_list = [example[i] for example in data_set]
unique_values = set(feature_list)
new_entropy = 0.0
for values in unique_values:
sub_data_set = split_data_set(data_set, i, values)
prob = len(sub_data_set) / float(len(data_set))
new_entropy += prob * cal_shannon_ent(sub_data_set)
info_gain = base_entropy - new_entropy
if info_gain > best_info_gain:
best_info_gain = info_gain
best_feature = i
return best_feature
def majority_cnt(class_list):
class_count = {}
for vote in class_list:
if vote not in class_count.keys():
class_count[vote] = 0
class_count[vote] += 1
sorted_class_count = sorted(class_count.iteritems(), key=operator.itemgetter(1), reverse=True)
return sorted_class_count[0][0]
def creat_tree(data_set, labels):
class_list = [example[-1] for example in data_set]
if class_list.count(class_list[0]) == len(class_list):
return class_list[0]
if len(data_set[0]) == 1:
return majority_cnt(class_list)
best_feature = choose_best_feature_to_split(data_set)
best_feature_label = labels[best_feature]
decision_tree = {best_feature_label:{}}
del(labels[best_feature])
feature_values = [example[best_feature] for example in data_set]
unique_values = set(feature_values)
for value in unique_values:
sub_labels = labels[:]
decision_tree[best_feature_label][value] = creat_tree(split_data_set(data_set, best_feature, value), sub_labels)
return decision_tree
class sklearn.tree.DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1,
min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, class_weight=None, presort=False)
Parameters | Type | Description |
---|---|---|
criterion | string optional ( default = ‘gini’ ) | 划分数据集的方法,支持的标准是基尼杂质的“gini”和信息增益的“熵”。 |
splitter | string optional ( default = ‘best’ ) | 用于在每个节点处选择拆分的策略。支持的策略是“最佳”选择最佳分割和“随机”选择最佳随机分割。 |
max_depth | int或None,optinal (default = None) | 树的最大深度。如果为None,则扩展节点直到所有叶子都是纯的或直到所有叶子包含少于min_samples_split样本。 |
min_samples_split | int,float,optional (default = 2 ) | 拆分内部节点所需的最小样本数:如果是int,则将min_samples_split视为最小数字。如果是float,则min_samples_split是百分比,ceil(min_samples_split * n_samples)是每个分割的最小样本数。 |
min_samples_leaf | int,float,optional(default = 1) | 叶子节点所需的最小样本数:如果是int,则将min_samples_leaf视为最小数字。如果是float,则min_samples_leaf是百分比, ceil(min_samples_leaf * n_samples)是每个节点的最小样本数。 |
min_weight_fraction_left | float,optional(default = 0) | 需要在叶节点处的权重总和(所有输入样本)的最小加权分数。当未提供sample_weight时,样本具有相同的权重。 |
max_features | int,float,string或None,optional(default = None) | 寻找最佳分割时要考虑的功能数量: 如果是int,则在每次拆分时考虑max_features功能。如果为float,则max_features为百分比,并 在每次拆分时考虑int(max_features * n_features)要素。如果是“auto”,则max_features = sqrt(n_features) 如果是“sqrt”,则max_features = sqrt(n_features) 如果是“log2”,则max_features = log2(n_features) 如果为None,则max_features = n_features 注意:在找到节点样本的至少一个有效分区之前,搜索分割不会停止,即使它需要有效地检查多个max_features 功能。 |
random_state | int,RandomState实例或None,optinal(default = None) | 如果是int,则random_state是随机数生成器使用的种子; 如果是RandomState实例,则random_state是随机数生成器; 如果没有,随机数生成器所使用的RandomState实例np.random。 |
max_leaf_nodes | int或None,optinal(default = None) | max_leaf_nodes 以最好的方式生成树。最佳节点定义为误差的相对减少。如果None则无限数量的叶节点。 |
min_impurity_decrease | float, optional (default=0.) | 最小阀值 |
min_impurity_split | float | 树木生长早期停止的门槛。如果节点的杂质高于阈值,节点将分裂,否则它是叶子。 |
class_weight | dict, list of dicts, “balanced” or None, default=None | 与表单中的类关联的权重。如果没有给出,所有课程都应该有一个重量。对于多输出问题,可以按与y列相同的顺序提供dicts列表。{class_label: weight} “平衡”模式使用y的值自动调整与输入数据中的类频率成反比的权重 n_samples / (n_classes * np.bincount(y)) |
presort | bool, optional (default=False) | 是否预先分配数据以加快拟合中最佳分裂的发现。对于大型数据集上决策树的默认设置,将其设置为true可能会降低训练过程的速度。使用较小的数据集或限制深度时,这可能会加快训练速度。 |