理解偏差和方差平衡技术

1.Bias and Variance

Understanding how different sources of error lead to bias and variance helps us improve the data fitting process resulting in more accurate models. We define bias and variance in three ways: conceptually, graphically and mathematically.

1.偏差和方差

理解不同错误导致的偏差和方差可以帮助我们提高数据对于模型的集合程度,从而提高模型的争取率。我们从三个方面来定义偏差和方差。这三个方面分别是概念定义,图形定义和数学定义。

1.1Conceptual Definition

  • Error due to Bias: The error due to bias is taken as the difference between the expected (or average) prediction of our model and the correct value which we are trying to predict. Of course you only have one model so talking about expected or average prediction values might seem a little strange. However, imagine you could repeat the whole model building process more than once: each time you gather new data and run a new analysis creating a new model. Due to randomness in the underlying data sets, the resulting models will have a range of predictions. Bias measures how far off in general these models' predictions are from the correct value.
  • Error due to Variance: The error due to variance is taken as the variability of a model prediction for a given data point. Again, imagine you can repeat the entire model building process multiple times. The variance is how much the predictions for a given point vary between different realizations of the model

1.1概念定义

         由于偏差导致的错误:偏差错误被认为是我们模型预测结果的期望和真实值期望之间的差异。当然你只有一个模型,所以谈论预测结果的期望有点奇怪。但是,想象一下,你不断使用新数据来构造模型,这样你就得到了多个模型,也就得到了多个预测结果。由于模型的数据是随机的,所以会产生一系列的预测。偏差就是衡量这些模型的预测与真实值的差别的。

         由于方差导致的错误:由于方差导致的错误被认为是一个模型对于一个数据点的预测的变化程度。想象一下,你可以构建你的模型多次。方差被认为是对于一个数据点来说预测的分散程度。

1.2Graphical Definition

      We can create a graphical visualization of bias and variance using a bulls-eye diagram. Imagine that the center of the target is a model that perfectly predicts the correct values. As we move away from the bulls-eye, our predictions get worse and worse. Imagine we can repeat our entire model building process to get a number of separate hits on the target. Each hit represents an individual realization of our model, given the chance variability in the training data we gather. Sometimes we will get a good distribution of training data so we predict very well and we are close to the bulls-eye, while sometimes our training data might be full of outliers or non-standard values resulting in poorer predictions. These different realizations result in a scatter of hits on the target.

We can plot four different cases representing combinations of both high and low bias and variance.

1.2图形定义

      我们可以用一个打靶图来说明偏差和方差。想象靶心就是我们模型要预测的真实值。当我们离靶子越远时,我们的预测变得越来越糟糕。想象重复整个模型建立的过程来得到多个散点在靶子上。每一个点代表一次模型的实现。当我们接近靶心时,可以认为我们得到了好的训练数据,因此我们可以做出好的预测。但是有时我们的训练数据可能充满了异常值和不标准值,导致预测的结果不好。这些不同的实现可以看做是靶子上的散点。

      我们可以看四张图来代表偏差值和方差值的高低程度。

理解偏差和方差平衡技术_第1张图片

理解偏差和方差平衡技术_第2张图片

1.3数学定义

        我们可以把要预测的变量定义为Y,协变量定义为X。我们可以推测Y和X之间有一个关系,例如Y=f(X)+e.其中e是服从正态分布的一个误差。

       我们可以用线性回归或者其他别的技术来做一个估计。在这种情况下,预测的错误的平方的期望就可以写作:

                                                 Err(x)=E[(Y-f(x))^2]

            这个错误可以主要分解成偏差和方差。(如上图公式)

            第三项是一个噪音项,噪音项是真实存在关系中的,不能被模型去除的。当给定真实模型和无穷数据去预测这种关系的,我们就可以把偏差和方差降低到0.但是现实世界中,我们往往无法找到准确的模型和无限的数据,所以在减小偏差和方差之间就要有一个平衡技术。



你可能感兴趣的:(理解偏差和方差平衡技术)