Kaggle学习 Learn Machine Learning 6.Underfitting, Overfitting and Model Optimization 欠拟合、过拟合和优化模型

6.Underfitting,Overfitting and Model Optimization欠拟合、过拟合和优化模型

本文是Kaggle自助学习下的文章,转回到目录点击这里

This tutorial is part of the series LearnMachine Learning . At the end of this step, you willunderstand the concepts of underfitting and overfitting, and you will be ableto apply these ideas to optimize your model accuracy.本教程是“Learn Machine Learning”系列的一部分。在本章节,你将了解欠拟合和过拟合的概念,并且你将能够应用这些概念来优化你的模型。

 

ExperimentingWith Different Models

Now that you have atrustworthy way to measure model accuracy, you can experiment with alternativemodels and see which gives the best predictions. But what alternatives do youhave for models?现在,你已经有了一种可靠的方法来衡量模型的准确性,你可以尝试其他模型,看看哪一个给出了最好的预测。但是你有什么模型可供选择呢?

 


You can see in scikit-learn's documentation that the decisiontree model has many options (more than you'll want or need for a long time).The most important options determine the tree's depth. Recall from page 2 thata tree's depth is a measure of how many splits it makes before coming to aprediction. This is a relatively shallow tree你可以在scikit-learn's的文档中看到,决策树模型有许多选项(远远超过你的想象),而其中最重要的选项决定树的深度。回想一下第2页,在进行预测之前,树的深度是决定了会产生多少分开的度量。这是一棵比较浅的树(深度为2)。(每一个结点都是一种选择,而叶节点就是预测的值)

Kaggle学习 Learn Machine Learning 6.Underfitting, Overfitting and Model Optimization 欠拟合、过拟合和优化模型_第1张图片

In practice, it's notuncommon for a tree to have 10 splits between the top level (all houses and aleaf). As the tree gets deeper, the dataset gets sliced up into leaves withfewer houses. If a tree only had 1 split, it divides the data into 2 groups. Ifeach group is split again, we would get 4 groups of houses. Splitting each ofthose again would create 8 groups. If we keep doubling the number of groups byadding more splits at each level, we'll have 210210  groups of houses by thetime we get to the 10th level. That's 1024 leaves. 在实践中,一棵树在顶层(所有房屋和一片叶子)之间有10个分裂并不罕见。随着树变得越来越深,数据集被分割成更少房屋的叶子。如果一棵树只有一个分割,它将数据分成两组。如果每个小组再次分裂,我们将获得4组房屋。再次拆分每一个将创建8个组。如果我们通过在每个级别增加更多分组来继续使组数增加一倍,那么在我们达到第10层时,我们将有210210组房屋。那是1024片叶子。

 

When we divide thehouses amongst many leaves, we also have fewer houses in each leaf. Leaves withvery few houses will make predictions that are quite close to those homes'actual values, but they may make very unreliable predictions for new data(because each prediction is based on only a few houses).当我们把房子分成许多片叶子时,我们每片叶子上的房子也就少了。有很少房子的叶子常常会做出非常接近这些房子实际价值的预测,但它们可能对新数据做出非常不可靠的预测(因为每个预测都只基于少数几个房子)。(感觉说的有点不明朗,个人理解应该是你划分的越准确,比如录入自己是个人,当你朋友来测试的时候会显示不是人,因为必须跟你一模一样才是人。过拟合)

 

This is a phenomenoncalled overfitting, where a modelmatches the training data almost perfectly, but does poorly in validation andother new data. On the flip side, if we make our tree very shallow, it doesn'tdivide up the houses into very distinct groups.这是一种称为过拟合的现象,其中模型几乎可以完美匹配训练数据,但对验证和其他新数据效果不佳。另一方面,如果我们让我们的树很浅,它也不能把房屋分成不同的群体。

 

At an extreme, if atree divides houses into only 2 or 4, each group still has a wide variety ofhouses. Resulting predictions may be far off for most houses, even in thetraining data (and it will be bad in validation too for the same reason). Whena model fails to capture important distinctions and patterns in the data, so itperforms poorly even in training data, that is called underfitting. 在极端情况下,如果一棵树将房屋分成只有2或4个,每个小组仍然有各种各样的房屋。对于大多数房屋来说,即使在培训数据中,最终的预测也可能很离谱(并且由于相同的原因,验证过程也会很糟糕)。当模型未能捕捉到数据中的重要区别和模式时,即使在训练数据中表现也很差,这就是所谓的“欠适合”。

 


Since we care about accuracy on new data, which we estimatefrom our validation data, we want to find the sweet spot between underfittingand overfitting. Visually, we want the low point of the (red) validation curvein由于我们关心的是我们从验证数据中估计出来的数据(验证数据)的准确性,因此我们希望找到欠拟合和过度拟合之间的最佳位置。在图形上,我们需要(红色)验证曲线的(最)低点。

Kaggle学习 Learn Machine Learning 6.Underfitting, Overfitting and Model Optimization 欠拟合、过拟合和优化模型_第2张图片

 

Example

There are a few alternativesfor controlling the tree depth, and many allow for some routes through the treeto have greater depth than other routes. But the max_leaf_nodes argumentprovides a very sensible way to control overfitting vs underfitting. The moreleaves we allow the model to make, the more we move from the underfitting areain the above graph to the overfitting area.有几种方法可以控制树的深度,许多树可以通过树的路径具有比其他路径更大的深度(不平衡的树)。但是,max_leaf_nodes参数提供了一种非常明智的方式来控制过度拟合vs欠适合。我们允许模型制作的叶片越多,我们从上图中的不足区域移动到过度拟合区域的越多。(越深意味着越接近训练数据,所以如果找到某个深度使其达到一种平衡,就可以避免这两种可能。)

 

We can use a utility functionto help compare MAE scores from different values for max_leaf_nodes:我们可以使用一个实用函数来帮助我们比较来自不同的max_leaf_nodes的MAE scores:

from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor
 
def get_mae(max_leaf_nodes, predictors_train, predictors_val, targ_train, targ_val):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(predictors_train, targ_train)
    preds_val = model.predict(predictors_val)
    mae = mean_absolute_error(targ_val, preds_val)
    return(mae)

 

The data is loaded into train_Xval_Xtrain_y and val_y using thecode you've already seen (and which you've already written). 使用你已经看到的代码(以及你已经写入的代码)将数据加载到train_X,val_X,train_y和val_y中(如有忘记,现在每次的代码会单独用txt封装)

# compare MAE with differing values of max_leaf_nodes
for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
     
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))
 
  

    这里定义了一个get_mae函数,参数有最大叶子数(?),训练数据集,测试数据集 返回的是 在当前叶子数下的绝对误差。(学会设置DecisionTreeRegressor参数的设置。)作用在于,寻找深度对于模型的平衡点。PS:我寻找出来在56-58左右。

 

Of the options listed, 500 is the optimal number of leaves. Applythe function to your Iowa data to find the best decision tree.列出的选项中,500是叶子的最佳数量。将函数应用于你的爱荷华州数据以找到最佳决策树。(这是他的数据的结果,我们的结果在56-58左右)

 

Conclusion 结论

      Here's thetakeaway: Models can suffer from either:这儿还有特色外卖:模特们可能会受到以下两种因素的影响:(就翻译成特色外卖吧,反正就那种意思。)

1.    Overfitting: capturing spuriouspatterns that won't recur in the future, leading to less accurate predictions,or模型捕捉到了将来不会再发生的虚假模式,从而导致不太准确的预测(or什么后面没有了啊,查看网页也没,看来是我不懂语法。)

2.    Underfitting: failing tocapture relevant patterns, again leading to less accurate predictions.未能完全捕捉到相关的模式,导致不那么准确的预测。

We use validation data,which isn't used in model training, to measure a candidate model's accuracy.This lets us try many candidate models and keep the best one.我们使用未在模型训练中使用的验证数据来测量候选模型的准确性。这可以让我们尝试很多候选模型,并保持着最佳模型。

 

But we're still usingDecision Tree models, which are not very sophisticated by modern machine learningstandards.但我们一直在使用决策树模型,这个模型在现代机器学习标准中并不十分复杂/精密。

 

Your Turn 该你啦

 

In the near future,you'll be efficient writing functions like get_mae yourself. For now, just copyit over to your work area. Then use a for loop that tries different values of max_leaf_nodesand calls the get_mae function on each to find the ideal number of leaves foryour Iowa data.在不久的将来,你将编写一些get_mae一样高效的函数。现在,你只需将它复制到你的工作区域即可。然后使用for循环来尝试max_leaf_nodes的不同值,并在每个函数上调用get_mae函数以找到爱荷华州数据的理想数目。

 

You should see that theideal number of leaves for Iowa data is less than the ideal number of leavesfor the Melbourne data. Remember, that a lower MAE is better.你应该看到爱荷华州数据的理想树叶数少于墨尔本数据的理想树叶数。请记住,较低的MAE更好。(MAE是偏差值,所以越小越好)

 

Continue   继续

Click here to learnyour first sophisticated Machine Learning model, the Random Forest. It is a clever extrapolation of the decision treemodel that consistently leads to more accurate predictions.点击这里学习你的第一个复杂的机器学习模型,随机森林。这是对决策树模型的一个巧妙的外推法,它始终能得到更准确的预测。

本文是Kaggle自助学习下的文章,转回到目录点击这里


你可能感兴趣的:(Kaggle)