决策树——过拟合的处理

本内容整理自coursera,欢迎交流转载。

1 过拟合回顾

  什么情况下我们就认为发生了过拟合呢?
  训练集误差越来越小,true error却先变小后变大,我们就说发生了过拟合(overfitting)。
  决策树——过拟合的处理_第1张图片

2 原则:使用简单的树

  当两棵树在validation set的分类误差相近的情况下,选择复杂度低的决策树。
  那么,我们怎么得到复杂度低的决策树呢?有两种方法:early stopping和pruning.后面分别介绍。

3 如何得到复杂度低的树?

3.1 Early Stopping

  • 限制树的深度:当达到设置好的最大深度的时候结束;
  • 分类误差法:当继续展开无法得到可观的分类误差减小,结束;
  • 叶子节点最小数据量限制:一个结点的数据量过小,结束。

3.2 Pruning(裁剪)

  early stopping的缺点:以异或为例,说明只有分到最后才能达到理想效果,但是early stopping过早结束。
  决策树——过拟合的处理_第2张图片
  当继续以X[1]或者X[2]分割,无法得到误差减小,于是我们就结束了。。。。
  决策树——过拟合的处理_第3张图片
  这显然是不合适的,因为我们几乎等于没有决策!!!
  在介绍pruning之前,先介绍一些概念: 
   L(T) 表示一棵树T的叶子节点数量,表征树的复杂度;
   Error(T) 表征分类误差
  因此,一棵树的代价:
   C(T)=λL(T)+Error(T)
λ=0 ,就是一般的决策树(前一篇博客描述的决策树),如果 λ=inf ,那么就是简单的把majority策略(数据集里1比0多,呢么所有数据都是1),等于没有决策; λ 介于之间,可以得到比较平衡的结果。
最后,pruning算法是这样的:
决策树——过拟合的处理_第4张图片

4 代码实现

可以在这里下载数据和代码.

#更改的只是加入了提前结束条件,防止过拟合,其余代码和前一篇博客的一样。
def decision_tree_create(data, features, target, current_depth = 0, 
                         max_depth = 10, min_node_size=1, 
                         min_error_reduction=0.0):

    remaining_features = features[:] # Make a copy of the features.

    target_values = data[target]
    print "--------------------------------------------------------------------"
    print "Subtree, depth = %s (%s data points)." % (current_depth, len(target_values))


    # Stopping condition 1: All nodes are of the same type.
    if intermediate_node_num_mistakes(target_values) == 0:
        print "Stopping condition 1 reached. All data points have the same target value."                
        return create_leaf(target_values)

    # Stopping condition 2: No more features to split on.
    if remaining_features == []:
        print "Stopping condition 2 reached. No remaining features."                
        return create_leaf(target_values)    

    # Early stopping condition 1: Reached max depth limit.
    if current_depth >= max_depth:
        print "Early stopping condition 1 reached. Reached maximum depth."
        return create_leaf(target_values)

    # Early stopping condition 2: Reached the minimum node size.
    # If the number of data points is less than or equal to the minimum size, return a leaf.
    if   reached_minimum_node_size(data,min_node_size):        ## YOUR CODE HERE 
        print "Early stopping condition 2 reached. Reached minimum node size."
        return create_leaf(target_values)  ## YOUR CODE HERE

    # Find the best splitting feature
    splitting_feature = best_splitting_feature(data, features, target)

    # Split on the best feature that we found. 
    left_split = data[data[splitting_feature] == 0]
    right_split = data[data[splitting_feature] == 1]

    # Early stopping condition 3: Minimum error reduction
    # Calculate the error before splitting (number of misclassified examples 
    # divided by the total number of examples)
    error_before_split = intermediate_node_num_mistakes(target_values) / float(len(data))

    # Calculate the error after splitting (number of misclassified examples 
    # in both groups divided by the total number of examples)
    left_mistakes =  intermediate_node_num_mistakes(left_split)  ## YOUR CODE HERE
    right_mistakes =  intermediate_node_num_mistakes(right_split) ## YOUR CODE HERE
    error_after_split = (left_mistakes + right_mistakes) / float(len(data))

    # If the error reduction is LESS THAN OR EQUAL TO min_error_reduction, return a leaf.
    if   error_reduction(error_before_split,error_after_split)<=min_error_reduction:      ## YOUR CODE HERE
        print "Early stopping condition 3 reached. Minimum error reduction."
        return  create_leaf(target_values) ## YOUR CODE HERE 


    remaining_features.remove(splitting_feature)
    print "Split on feature %s. (%s, %s)" % (\
                      splitting_feature, len(left_split), len(right_split))


    # Repeat (recurse) on left and right subtrees
    left_tree = decision_tree_create(left_split, remaining_features, target, 
                                     current_depth + 1, max_depth, min_node_size, min_error_reduction)        

    ## YOUR CODE HERE
    right_tree = decision_tree_create(right_split,remaining_features,target,current_depth+1,max_depth,min_node_size,min_error_reduction)


    return {'is_leaf'          : False, 
            'prediction'       : None,
            'splitting_feature': splitting_feature,
            'left'             : left_tree, 
            'right'            : right_tree}

你可能感兴趣的:(机器学习,python)