【吴恩达机器学习】第三周课程精简笔记——对数几率回归和正则化

Logistic Regression(对数几率回归/逻辑回归)

1. Classification and Representation(分类和表示)

(1)Classification

To attempt classification, one method is to use linear regression and map all predictions greater than 0.5 as a 1 and all less than 0.5 as a 0. However, this method doesn’t work well because classification is not actually a linear function.

The classification problem is just like the regression problem, except that the values we now want to predict take on only a small number of discrete values.For now, we will focus on the binary classification problem in which y can take on only two values, 0 and 1. (Most of what we say here will also generalize to the multiple-class case.)
【吴恩达机器学习】第三周课程精简笔记——对数几率回归和正则化_第1张图片


想要分类,一种方法是使用线性回归,将所有大于0.5的预测都映射为1,所有小于0.5的预测都映射为0。然而,这种方法并不能很好地工作,因为分类实际上不是一个线性函数。

除了我们现在想要预测的值只取少量的离散值这一特点以外,分类问题就像回归问题一样。现在,我们将关注二元分类问题,在这个问题中,y只能取两个值,0和1。(我们在这里所说的大部分内容也适用于多类情况。)

(2)Hypothesis Representation(假设表示)

We could approach the classification problem ignoring the fact that y is discrete-valued, and use our old linear regression algorithm to try to predict y given x. However, it is easy to construct examples where this method performs very poorly. Intutitively, it also doesn’t make sense for h θ ( x ) h_\theta(x) hθ(x) to take values larger than 1 or smaller than 0 when we know that y ∈ {0, 1}. To fix this, let’s change the form for our hypothesis h θ ( x ) h_\theta(x) hθ(x) to satisfy 0 ≤ h θ ( x ) ≤ 1 0≤h_\theta(x)≤1 0hθ(x)1. This is accomplished by plugging θ T x \theta^Tx θTx into the Logistic Function.

Our new form uses the “Sigmoid Function,” also called the “Logistic Function”:
h θ ( x ) = g ( θ T x ) g ( z ) = 1 1 + e − z z = θ T x \begin{aligned} h_\theta(x) &= g(\theta^Tx) \\ g(z) &= \frac{1}{1+e^{-z}} \\ z &= \theta^Tx \end{aligned} hθ(x)g(z)z=g(θTx)=1+ez1=θTx

The following image shows us what the sigmoid function looks like:
【吴恩达机器学习】第三周课程精简笔记——对数几率回归和正则化_第2张图片
The function g(z), shown here, maps any real number to the (0, 1) interval, making it useful for transforming an arbitrary-valued function into a function better suited for classification.
h θ ( x ) = P ( y = 1 ∣ x ; θ ) = 1 − P ( y = 0 ∣ x ; θ ) P ( y = 0 ∣ x ; θ ) + P ( y = 1 ∣ x ; θ ) = 1 \begin{aligned} &h_\theta(x) = P(y=1|x;\theta)=1-P(y=0|x;\theta) \\ &P(y=0|x;\theta) + P(y=1|x;\theta) = 1 \end{aligned} hθ(x)=P(y=1x;θ)=1P(y=0x;θ)P(y=0x;θ)+P(y=1x;θ)=1

h θ ( x ) h_\theta(x) hθ(x) will give us the probability that our output is 1. For example, h θ ( x ) = 0.7 h_\theta(x)=0.7 hθ(x)=0.7 gives us a probability of 70% that our output is 1. Our probability that our prediction is 0 is just the complement of our probability that it is 1 (e.g. if probability that it is 1 is 70%, then the probability that it is 0 is 30%).


我们可以忽略y是离散值这一事实来处理分类问题,并使用我们上一周使用的线性回归算法来尝试在给定x时预测y的值。然而,很容易构造例子来证明出这种方法的性能非常差。直观地说,当我们知道 y∈{0,1} 时,对于 h   ( x ) h_\ (x) h (x)取大于1或小于0的值也没有意义。 为了解决这个问题,让我们改变假设 h   ( x ) h_\ (x) h (x) 的形式来满足 0 ≤ h   ( x ) ≤ 1 0≤h_\ (x)≤1 0h (x)1 。 这是通过将   θ T x \ θ ^Tx  θTx 代入Logistic函数来完成的。

我们的新形式使用了“Sigmoid函数”,也称为“Logistic函数”:
h θ ( x ) = g ( θ T x ) g ( z ) = 1 1 + e − z z = θ T x \begin{aligned} h_\theta(x) &= g(\theta^Tx) \\ g(z) &= \frac{1}{1+e^{-z}} \\ z &= \theta^Tx \end{aligned} hθ(x)g(z)z=g(θTx)=1+ez1=θTx

下面的图像展示了s形函数的样子:
【吴恩达机器学习】第三周课程精简笔记——对数几率回归和正则化_第3张图片
如图所示,函数g(z)将任意实数映射到(0,1)区间,这使得它可以将任意值函数转换为更适合分类的函数。
h θ ( x ) = P ( y = 1 ∣ x ; θ ) = 1 − P ( y = 0 ∣ x ; θ ) P ( y = 0 ∣ x ; θ ) + P ( y = 1 ∣ x ; θ ) = 1 \begin{aligned} &h_\theta(x) = P(y=1|x;\theta)=1-P(y=0|x;\theta) \\ &P(y=0|x;\theta) + P(y=1|x;\theta) = 1 \end{aligned} hθ(x)=P(y=1x;θ)=1P(y=0x;θ)P(y=0x;θ)+P(y=1x;θ)=1
h θ ( x ) h_ \theta(x) hθ(x) 给出了输出为1时的概率。 例如, h   ( x ) = 0.7 h_\ (x)=0.7 h (x)=0.7 表示输出为1的概率为70%。 我们预测结果为0的概率是预测结果为1的概率的补充(例如,如果预测结果为1的概率是70%,那么预测结果为0的概率是30% )。

(3)Decision Boundary(决策边界)

1)Logistic regression

【吴恩达机器学习】第三周课程精简笔记——对数几率回归和正则化_第4张图片
Remember:
z = 0 , e 0 = 1 ⇒ g ( z ) = 1 2 z → ∞ , e − ∞ → 0 ⇒ g ( z ) = 1 z → − ∞ , e ∞ → ∞ ⇒ g ( z ) = 0 z = 0,e^0=1 \Rightarrow g(z)=\frac{1}{2} \\ z \to \infty, e^{-\infty} \to 0 \Rightarrow g(z) = 1 \\ z \to -\infty, e^{\infty} \to \infty \Rightarrow g(z) = 0 z=0,e0=1g(z)=21z,e0g(z)=1z,eg(z)=0

2)Decision Boundary

【吴恩达机器学习】第三周课程精简笔记——对数几率回归和正则化_第5张图片

3)Non-Linear decision boundaries

【吴恩达机器学习】第三周课程精简笔记——对数几率回归和正则化_第6张图片

2. Logistic Regression Model

(1)Cost Function

【吴恩达机器学习】第三周课程精简笔记——对数几率回归和正则化_第7张图片

We cannot use the same cost function that we use for linear regression because the Logistic Function will cause the output to be wavy, causing many local optima. In other words, it will not be a convex function.
【吴恩达机器学习】第三周课程精简笔记——对数几率回归和正则化_第8张图片
Instead, our cost function for logistic regression looks like:

J ( θ ) = 1 m ∑ i = 1 m C o s t ( h θ ( x ( i ) ) , y ( i ) ) C o s t ( h θ ( x ) , y ) = − l o g ( h θ ( x ) ) i f   y = 1 C o s t ( h θ ( x ) , y ) = − l o g ( 1 − h θ ( x ) ) i f   y = 0 \begin{aligned} &J(\theta) = \frac {1} {m} \sum_{i=1}^mCost(h_\theta(x^{(i)}),y^{(i)}) \\ &Cost(h_\theta(x),y)=-log(h_\theta(x)) \qquad \qquad if \ y = 1 \\ &Cost(h_\theta(x),y)=-log(1-h_\theta(x)) \qquad if \ y = 0 \end{aligned} J(θ)=m1i=1mCost(hθ(x(i)),y(i))Cost(hθ(x),y)=log(hθ(x))if y=1Cost(hθ(x),y)=log(1hθ(x))if y=0

When y = 1, we get the following plot for J ( θ ) J(\theta) J(θ) vs h θ ( x ) h_\theta(x) hθ(x)
【吴恩达机器学习】第三周课程精简笔记——对数几率回归和正则化_第9张图片
If our correct answer ‘y’ is 1, then the cost function will be 0 if our hypothesis function outputs 1. If our hypothesis approaches 0, then the cost function will approach infinity.

Similarly, when y = 0, we get the following plot for J ( θ ) J(\theta) J(θ) vs h θ ( x ) h_\theta (x) hθ(x):
【吴恩达机器学习】第三周课程精简笔记——对数几率回归和正则化_第10张图片

If our correct answer ‘y’ is 0, then the cost function will be 0 if our hypothesis function also outputs 0. If our hypothesis approaches 1, then the cost function will approach infinity.


【吴恩达机器学习】第三周课程精简笔记——对数几率回归和正则化_第11张图片

我们不能使用与线性回归相同的成本函数,因为Logistic函数会导致输出波动,导致许多局部最优值。 换句话说,它不是一个凸函数。
【吴恩达机器学习】第三周课程精简笔记——对数几率回归和正则化_第12张图片

相反,我们的对数几率回归的代价函数是这样的:
J ( θ ) = 1 m ∑ i = 1 m C o s t ( h θ ( x ( i ) ) , y ( i ) ) C o s t ( h θ ( x ) , y ) = − l o g ( h θ ( x ) ) i f   y = 1 C o s t ( h θ ( x ) , y ) = − l o g ( 1 − h θ ( x ) ) i f   y = 0 \begin{aligned} &J(\theta) = \frac {1} {m} \sum_{i=1}^mCost(h_\theta(x^{(i)}),y^{(i)}) \\ &Cost(h_\theta(x),y)=-log(h_\theta(x)) \qquad \qquad if \ y = 1 \\ &Cost(h_\theta(x),y)=-log(1-h_\theta(x)) \qquad if \ y = 0 \end{aligned} J(θ)=m1i=1mCost(hθ(x(i)),y(i))Cost(hθ(x),y)=log(hθ(x))if y=1Cost(hθ(x),y)=log(1hθ(x))if y=0

当y = 1时,我们得到 J ( Θ ) J(\Theta) J(Θ) vs h ( x ) h(x) h(x)的如下图:
【吴恩达机器学习】第三周课程精简笔记——对数几率回归和正则化_第13张图片

当我们的正确答案y是1,那么如果假设函数输出1代价函数就是0。 如果我们的假设趋于0,那么代价函数将趋于无穷。

类似地,当y = 0时,我们得到 J ( θ ) J(θ) J(θ) vs h θ ( x ) h_\theta (x) hθ(x)的下图:
【吴恩达机器学习】第三周课程精简笔记——对数几率回归和正则化_第14张图片
当我们的正确答案y是0,如果我们的假设函数输出也是0,那么代价函数就是0。 如果我们的假设趋于1,那么代价函数将趋于无穷。

(2)Simplified Cost Function and Gradient Descent(简化代价函数和梯度下降)

We can compress our cost function’s two conditional cases into one case:
C o s t ( h θ ( x ) , y ) = − y l o g ( h θ ( x ) ) − ( 1 − y ) l o g ( 1 − h θ ( x ) ) Cost(h_\theta(x),y) = -ylog(h_\theta(x))-(1-y)log(1-h_\theta(x)) Cost(hθ(x),y)=ylog(hθ(x))(1y)log(1hθ(x))
We can fully write out our entire cost function as follows:
J ( θ ) = − 1 m ∑ i = 1 m [ y ( i ) l o g ( h θ ( x i ) ) + ( 1 − y ( i ) ) l o g ( 1 − h θ ( x i ) ) ] J(\theta)=-\frac{1}{m} \sum_{i=1}^{m}[y^{(i)}log(h_\theta(x^{i}))+(1-y^{(i)})log(1-h_\theta(x^i))] J(θ)=m1i=1m[y(i)log(hθ(xi))+(1y(i))log(1hθ(xi))]
A vectorized implementation is:
h = g ( θ T X ) J ( θ ) = 1 m ( − y T l o g ( h ) − ( 1 − y ) T l o g ( 1 − h ) ) \begin{aligned} &h = g(\theta^TX) \\ &J(\theta) = \frac{1}{m}(-y^Tlog(h)-(1-y)^Tlog(1-h)) \end{aligned} h=g(θTX)J(θ)=m1(yTlog(h)(1y)Tlog(1h))
We can work out the derivative part using calculus to get:
R e p e a t { θ j : = θ j − α m ∑ i = 1 m ( h θ ( x ( i ) − y ( i ) ) x j ( i ) } \begin{aligned} &Repeat \{ \\ &\theta_j := \theta_j - \frac{\alpha}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}-y^{(i)})x_j^{(i)} \\ \} \end{aligned} }Repeat{θj:=θjmαi=1m(hθ(x(i)y(i))xj(i)

Notice that this algorithm is identical to the one we used in linear regression. We still have to simultaneously update all values in theta.

A vectorized implementation is:
θ : = θ − α m X T ( g ( θ T X ) − y ⃗ ) \theta := \theta - \frac{\alpha}{m}X^T(g(\theta^TX)-\vec{y}) θ:=θmαXT(g(θTX)y )

This cost function can be derived from statistics using the principle of maximum likelihood estimation which is an idea in statistics for how to efficiently find parameters theta for different models. And it also has a nice property that it is a convex.


我们可以将代价函数的两个条件情形压缩为一个情形:
C o s t ( h θ ( x ) , y ) = − y l o g ( h θ ( x ) ) − ( 1 − y ) l o g ( 1 − h θ ( x ) ) Cost(h_\theta(x),y) = -ylog(h_\theta(x))-(1-y)log(1-h_\theta(x)) Cost(hθ(x),y)=ylog(hθ(x))(1y)log(1hθ(x))
我们可以完全写出整个代价函数如下:
J ( θ ) = − 1 m ∑ i = 1 m [ y ( i ) l o g ( h θ ( x i ) ) + ( 1 − y ( i ) ) l o g ( 1 − h θ ( x i ) ) ] J(\theta)=-\frac{1}{m} \sum_{i=1}^{m}[y^{(i)}log(h_\theta(x^{i}))+(1-y^{(i)})log(1-h_\theta(x^i))] J(θ)=m1i=1m[y(i)log(hθ(xi))+(1y(i))log(1hθ(xi))]
一个向量化的实现是:
h = g ( θ T X ) J ( θ ) = 1 m ( − y T l o g ( h ) − ( 1 − y ) T l o g ( 1 − h ) ) h = g(\theta^TX) J(\theta) = \frac{1}{m}(-y^Tlog(h)-(1-y)^Tlog(1-h)) h=g(θTX)J(θ)=m1(yTlog(h)(1y)Tlog(1h))
我们可以用微积分计算出导数部分,得到:
R e p e a t { θ j : = θ j − α m ∑ i = 1 m ( h θ ( x ( i ) − y ( i ) ) x j ( i ) } \begin{aligned} Repeat &\{ \\ &\theta_j := \theta_j - \frac{\alpha}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}-y^{(i)})x_j^{(i)} \\ \} \end{aligned} Repeat}{θj:=θjmαi=1m(hθ(x(i)y(i))xj(i)
注意,这个算法与我们在线性回归中使用的算法相同。 我们仍然需要同时更新所有的值。

一个向量化的实现是:
θ : = θ − α m X T ( g ( θ T X ) − y ⃗ ) \theta := \theta - \frac{\alpha}{m}X^T(g(\theta^TX)-\vec{y}) θ:=θmαXT(g(θTX)y )
这个代价函数可以用极大似然估计准则从统计中推导出来,极大似然估计是统计学中的一种思想,用来有效地为不同的模型找到参数 θ \theta θ。 它还有一个很好的性质,那就是它估计出的函数是凸函数。

(3)Advanced Optimization(高级优化)

“Conjugate gradient”, “BFGS”, and “L-BFGS” are more sophisticated, faster ways to optimize θ that can be used instead of gradient descent.

Advantages:
No need to manually pick α \alpha α, instead have a clever inner-loop.
It is often faster than gradient descent

Disadvantages:
More Complex.

We suggest that you should not write these more sophisticated algorithms yourself (unless you are an expert in numerical computing) but use the libraries instead, as they’re already tested and highly optimized. Octave provides them.


“共轭梯度法”、“BFGS法”和“L-BFGS法”是一种比梯度下降法更复杂、更快速的θ优化方法。

优点:
不需要手动选择 α \alpha α,而是有一个更智能的内部循环。
它通常比梯度下降快。

缺点:
更加复杂。

我们建议您不要自己编写这些更复杂的算法(除非您是数值计算方面的专家),而是使用库,因为它们已经经过了测试和高度优化。

3. Multiclass Classification:One-vs-all

Now we will approach the classification of data when we have more than two categories. Instead of y = {0,1} we will expand our definition so that y = {0,1…n}.
【吴恩达机器学习】第三周课程精简笔记——对数几率回归和正则化_第15张图片
Since y = {0,1…n}, we divide our problem into n+1 (+1 because the index starts at 0) binary classification problems; in each one, we predict the probability that ‘y’ is a member of one of our classes.
【吴恩达机器学习】第三周课程精简笔记——对数几率回归和正则化_第16张图片
One-vs-all: Train a logistic regression classifier h θ ( i ) ( x ) h_\theta^{(i)}(x) hθ(i)(x) for each class i to predict the probability that y = i.

On a new input x, to make a prediction pick the class i that maximizes.
max ⁡ i h θ ( i ) ( x ) \max \limits_{i} h_\theta^{(i)}(x) imaxhθ(i)(x)


现在,当我们有两个以上的类别时,我们将处理数据的分类。 我们将扩展定义,使y ={0,1…n},而不是y ={0,1}。
【吴恩达机器学习】第三周课程精简笔记——对数几率回归和正则化_第17张图片
因为y ={0,1… N},我们将问题划分为N +1(+1是因为索引从0开始)二元分类问题; 在每个类中,我们预测“y”是我们类中的一个成员的概率。
【吴恩达机器学习】第三周课程精简笔记——对数几率回归和正则化_第18张图片
一对多: 训练一个逻辑回归分类器 h θ ( i ) ( x ) h_\theta^{(i)}(x) hθ(i)(x) 为每个类 i 预测 y = i 的概率 。

对于一个新的输入x,为了做出预测,选择类 i 使 h θ ( i ) ( x ) h_\theta^{(i)}(x) hθ(i)(x) 最大化。
max ⁡ i h θ ( i ) ( x ) \max \limits_{i} h_\theta^{(i)}(x) imaxhθ(i)(x)

4. Solving the Problem of Overfitting

(1)The Problem of Overfitting

【吴恩达机器学习】第三周课程精简笔记——对数几率回归和正则化_第19张图片
Without formally defining what these terms mean, we’ll say the figure on the left shows an instance of underfitting—in which the data clearly shows structure not captured by the model—and the figure on the right is an example of overfitting.
【吴恩达机器学习】第三周课程精简笔记——对数几率回归和正则化_第20张图片
Underfitting, or high bias, is when the form of our hypothesis function h maps poorly to the trend of the data. It is usually caused by a function that is too simple or uses too few features. At the other extreme, overfitting, or high variance, is caused by a hypothesis function that fits the available data but does not generalize well to predict new data. It is usually caused by a complicated function that creates a lot of unnecessary curves and angles unrelated to the data.

This terminology is applied to both linear and logistic regression. There are two main options to address the issue of overfitting:

  1. Reduce the number of features:
  • Manually select which features to keep.
  • Use a model selection algorithm (studied later in the course).
  1. Regularization
  • Keep all the features, but reduce the magnitude of parameters θ j \theta_j θj .
  • Regularization works well when we have a lot of slightly useful features.

【吴恩达机器学习】第三周课程精简笔记——对数几率回归和正则化_第21张图片

在没有正式定义这些术语含义的情况下,我们会说左边的图显示了一个欠拟合的例子——其中数据清楚地显示了模型没有捕获到的结构——而右边的图则是一个过拟合的例子。

【吴恩达机器学习】第三周课程精简笔记——对数几率回归和正则化_第22张图片
欠拟合或高偏倚,指当我们的假设函数h的形式不能很好的映射出数据的趋势。它通常是由于一个函数太简单或使用特征太少。在另一个极端,过度拟合或高方差,指由于一个假设函数能很好的你和现有的数据,但不能很好地泛化以预测新的数据。这通常是由一个复杂的函数引起的,它会产生大量与数据无关的不必要的曲线和角度。

这个术语既适用于线性回归,也适用于逻辑回归。 解决过拟合问题有两种主要方法:

  1. 减少特征的数量:
  • 手动选择要保留的特征。
  • 使用模型选择算法(稍后将在课程中学习)。
  1. 正则化
  • 保留所有的特性,但减少参数 θ j \theta_j θj 的数量级。
  • 当我们有很多稍微有用的特性时,正则化工作得很好。

(2)Cost Function

If we have overfitting from our hypothesis function, we can reduce the weight that some of the terms in our function carry by increasing their cost.

Say we wanted to make the following function more quadratic:
θ 0 + θ 1 x + θ 2 x 2 + θ 3 x 3 + θ 4 x 4 \theta_0+\theta_1x+\theta_2x^2+\theta_3x^3+\theta_4x^4 θ0+θ1x+θ2x2+θ3x3+θ4x4
【吴恩达机器学习】第三周课程精简笔记——对数几率回归和正则化_第23张图片

We’ll want to eliminate the influence of θ 3 x 3 \theta_3x^3 θ3x3 and θ 4 x 4 \theta_4x^4 θ4x4.Without actually getting rid of these features or changing the form of our hypothesis, we can instead modify our cost function:
min ⁡ θ 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 + 1000 ⋅ θ 3 2 + 1000 ⋅ θ 4 2 \min \limits_{\theta} \frac{1}{2m} \sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2+1000·\theta_3^2+1000·\theta_4^2 θmin2m1i=1m(hθ(x(i))y(i))2+1000θ32+1000θ42

We’ve added two extra terms at the end to inflate the cost of θ 3 \theta_3 θ3 and θ 4 \theta_4 θ4. Now, in order for the cost function to get close to zero, we will have to reduce the values of θ 3 \theta_3 θ3 and θ 4 \theta_4 θ4 to near zero. This will in turn greatly reduce the values of θ 3 x 3 \theta_3x^3 θ3x3 and θ 4 x 4 \theta_4x^4 θ4x4 in our hypothesis function. As a result, we see that the new hypothesis (depicted by the pink curve) looks like a quadratic function but fits the data better due to the extra small terms θ 3 x 3 \theta_3x^3 θ3x3 and θ 4 x 4 \theta_4x^4 θ4x4.
【吴恩达机器学习】第三周课程精简笔记——对数几率回归和正则化_第24张图片
We could also regularize all of our theta parameters in a single summation as:
min ⁡ θ 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 + λ ∑ j = 1 n θ j 2 \min \limits_{\theta} \frac{1}{2m} \sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2+\lambda \sum_{j=1}^n \theta_j^2 θmin2m1i=1m(hθ(x(i))y(i))2+λj=1nθj2

The λ, or lambda, is the regularization parameter. It determines how much the costs of our theta parameters are inflated.

Using the above cost function with the extra summation, we can smooth the output of our hypothesis function to reduce overfitting. If lambda is chosen to be too large, it may smooth out the function too much and cause underfitting.
【吴恩达机器学习】第三周课程精简笔记——对数几率回归和正则化_第25张图片
If θ 1 \theta_1 θ1≈0 , θ 2 \theta_2 θ2≈0 , θ 3 \theta_3 θ3≈0 , θ 4 \theta_4 θ4≈0, h θ ( x ) h_\theta(x) hθ(x) is equail θ 0 \theta_0 θ0 which is going to be a flat straight line


如果我们的假设函数过拟合,我们可以通过增加一些项的代价来减少函数中这些项的权重。

假设我们想让以下函数保留更多次方:
θ 0 + θ 1 x + θ 2 x 2 + θ 3 x 3 + θ 4 x 4 \theta_0+\theta_1x+\theta_2x^2+\theta_3x^3+\theta_4x^4 θ0+θ1x+θ2x2+θ3x3+θ4x4
【吴恩达机器学习】第三周课程精简笔记——对数几率回归和正则化_第26张图片
我们要消除 θ 3 x 3 \theta_3x^3 θ3x3 θ 4 x 4 \theta_4x^4 θ4x4 的影响。在不去除这些特征或改变假设形式的情况下,我们可以修改我们的代价函数:
min ⁡ θ 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 + 1000 ⋅ θ 3 2 + 1000 ⋅ θ 4 2 \min \limits_{\theta} \frac{1}{2m} \sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2+1000·\theta_3^2+1000·\theta_4^2 θmin2m1i=1m(hθ(x(i))y(i))2+1000θ32+1000θ42

我们在式子末尾增加了额外的两项,以提高 θ 3 \theta_3 θ3 θ 4 \theta_4 θ4 的代价。现在,为了使代价函数接近于零,我们必须将 θ 3 \theta_3 θ3 θ 4 \theta_4 θ4 的值降低到接近于零。 这将反过来将会使我们的假设函数中的 θ 3 x 3 \theta_3x^3 θ3x3 θ 4 x 4 \theta_4x^4 θ4x4 的值大大减少。 结果,我们看到新的假设(用粉色曲线表示)看起来像一个二次函数,由于有额外的小项 θ 3 x 3 \theta_3x^3 θ3x3 θ 4 x 4 \theta_4x^4 θ4x4,故该函数能更好的拟合数据。
【吴恩达机器学习】第三周课程精简笔记——对数几率回归和正则化_第27张图片
我们也可以在一个求和(函数)中正则化所有的参数 θ \theta θ
min ⁡ θ 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 + λ ∑ j = 1 n θ j 2 \min \limits_{\theta} \frac{1}{2m} \sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2+\lambda \sum_{j=1}^n \theta_j^2 θmin2m1i=1m(hθ(x(i))y(i))2+λj=1nθj2

λ或lambda是正则化参数,它决定了参数膨胀代价的大小。

使用上面的代价函数和额外的求和(函数),我们可以平滑我们的假设函数的输出,以减少过拟合。 如果选择的lambda太大,可能会使函数过于平滑,导致欠拟合。
【吴恩达机器学习】第三周课程精简笔记——对数几率回归和正则化_第28张图片
如果 θ 1 \theta_1 θ1≈0 , θ 2 \theta_2 θ2≈0 , θ 3 \theta_3 θ3≈0 , θ 4 \theta_4 θ4≈0, h θ ( x ) h_\theta(x) hθ(x) = θ 0 \theta_0 θ0,会成为一条平坦的直线。

(3)Regularized Linear Regression

Gradient Descent
We will modify our gradient descent function to separate out θ 0 \theta_0 θ0 from the rest of the parameters because we do not want to penalize θ 0 \theta_0 θ0.
R e p e a t { θ 0 : = θ 0 − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x 0 ( j ) θ j : = θ j − α [ ( 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( j ) ) + λ m θ j ] j ∈ { 1 , 2 , 3... n } } \begin{aligned} Repeat& \{ \\ &\theta_0 := \theta_0 - \alpha \frac{1}{m} \sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_0^{(j)} \\ &\theta_j := \theta_j - \alpha[ (\frac{1}{m} \sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(j)}) + \frac{\lambda}{m} \theta_j ] \qquad j \in \{1,2,3...n\} \\ \} \end{aligned} Repeat}{θ0:=θ0αm1i=1m(hθ(x(i))y(i))x0(j)θj:=θjα[(m1i=1m(hθ(x(i))y(i))xj(j))+mλθj]j{1,2,3...n}

The term λ m θ j \frac{\lambda}{m}\theta_j mλθj performs our regularization. With some manipulation our update rule can also be represented as:
θ j : = θ j ( 1 − α λ m ) − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) \theta_j := \theta_j(1-\alpha \frac{\lambda}{m})-\alpha \frac{1}{m} \sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)} θj:=θj(1αmλ)αm1i=1m(hθ(x(i))y(i))xj(i)

The first term in the above equation, 1 − α λ m 1 - \alpha\frac{\lambda}{m} 1αmλ will always be less than 1. Intuitively you can see it as reducing the value of θ j \theta_j θj by some amount on every update. Notice that the second term is now exactly the same as it was before.

Normal Equation
To add in regularization, the equation is the same as our original, except that we add another term inside the parentheses:
θ = ( X T X + λ ⋅ L ) − 1 X T y w h e r e   L = [ 0 1 1 ⋱ 1 ] \theta=(X^TX+\lambda·L)^{-1}X^Ty \\ where \ L = \begin{bmatrix} 0 & & & & \\ & 1 & & & \\ & & 1 & & \\ & & & \ddots \\ & & & & 1 \end{bmatrix} θ=(XTX+λL)1XTywhere L=0111
L is a matrix with 0 at the top left and 1’s down the diagonal, with 0’s everywhere else. It should have dimension (n+1)×(n+1). Intuitively, this is the identity matrix (though we are not including x 0 x_0 x0), multiplied with a single real number λ.

Recall that if m < n, then X T X X^TX XTX is non-invertible. However, when we add the term λ⋅L, then X T X + λ ⋅ L X^TX+ λ⋅L XTX+λL becomes invertible. This matrix will be invertible and will not be singular.


梯度下降法
我们将修改梯度下降函数,将 θ 0 \theta_0 θ0 从其他参数中分离出来,因为我们不想惩罚 θ 0 \theta_0 θ0
R e p e a t { θ 0 : = θ 0 − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x 0 ( j ) θ j : = θ j − α [ ( 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( j ) ) + λ m θ j ] j ∈ { 1 , 2 , 3... n } } \begin{aligned} Repeat& \{ \\ &\theta_0 := \theta_0 - \alpha \frac{1}{m} \sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_0^{(j)} \\ &\theta_j := \theta_j - \alpha[ (\frac{1}{m} \sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(j)}) + \frac{\lambda}{m} \theta_j ] \qquad j \in \{1,2,3...n\} \\ \} \end{aligned} Repeat}{θ0:=θ0αm1i=1m(hθ(x(i))y(i))x0(j)θj:=θjα[(m1i=1m(hθ(x(i))y(i))xj(j))+mλθj]j{1,2,3...n}

λ m θ j \frac{\lambda}{m}\theta_j mλθj 执行我们的正则化。 通过一些操作,我们的更新规则也可以表示为:
θ j : = θ j ( 1 − α λ m ) − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) \theta_j := \theta_j(1-\alpha \frac{\lambda}{m})-\alpha \frac{1}{m} \sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)} θj:=θj(1αmλ)αm1i=1m(hθ(x(i))y(i))xj(i)

上式的第一项 1 − α λ m 1 - \alpha\frac{\lambda}{m} 1αmλ 总是小于1。 直观地,您可以看到它在每次更新时减少了 θ j \theta_j θj 的值。 注意,第二项现在和以前完全一样。

正规方程法
在正则化中,方程和原来的方程一样,除了我们在括号中添加了另一项:
θ = ( X T X + λ ⋅ L ) − 1 X T y w h e r e   L = [ 0 1 1 ⋱ 1 ] \theta=(X^TX+\lambda·L)^{-1}X^Ty \\ where \ L = \begin{bmatrix} 0 & & & & \\ & 1 & & & \\ & & 1 & & \\ & & & \ddots \\ & & & & 1 \end{bmatrix} θ=(XTX+λL)1XTywhere L=0111
L是一个矩阵,0在左上角,1在对角线上,其他地方都是0。 它的维数应该是(n+1)×(n+1)。 直观地说,这是单位矩阵(尽管不包括 x 0 x_0 x0)乘以一个实数λ。

回想一下,如果m < n,那么 X T X X^TX XTX是不可逆的。 然而,当我们添加λ⋅L项时,则 X T X + λ ⋅ L X^TX+ λ⋅L XTX+λL变为可逆项。这个矩阵是可逆的并且不是奇异的。

(4)Regularized Logistic Regression

We can regularize logistic regression in a similar way that we regularize linear regression. As a result, we can avoid overfitting. The following image shows how the regularized function, displayed by the pink line, is less likely to overfit than the non-regularized function represented by the blue line:
【吴恩达机器学习】第三周课程精简笔记——对数几率回归和正则化_第29张图片
Cost Function
Recall that our cost function for logistic regression was:
J ( θ ) = − 1 m ∑ i = 1 m [ y ( i ) l o g ( h θ ( x ( i ) ) ) + ( 1 − y ( i ) ) l o g ( 1 − h θ ( x ( i ) ) ) ] J(\theta)=-\frac{1}{m} \sum_{i=1}^m[y^{(i)}log(h_\theta(x^{(i)}))+(1-y^{(i)})log(1-h_\theta(x^{(i)}))] J(θ)=m1i=1m[y(i)log(hθ(x(i)))+(1y(i))log(1hθ(x(i)))]

We can regularize this equation by adding a term to the end:
J ( θ ) = − 1 m ∑ i = 1 m [ y ( i ) l o g ( h θ ( x ( i ) ) ) + ( 1 − y ( i ) ) l o g ( 1 − h θ ( x ( i ) ) ) + λ 2 m ∑ i = 1 n θ j 2 ] J(\theta)=-\frac{1}{m} \sum_{i=1}^m[y^{(i)}log(h_\theta(x^{(i)}))+(1-y^{(i)})log(1-h_\theta(x^{(i)}))+\frac{\lambda}{2m}\sum_{i=1}^n\theta_j^2] J(θ)=m1i=1m[y(i)log(hθ(x(i)))+(1y(i))log(1hθ(x(i)))+2mλi=1nθj2]

The second sum, ∑ j = 1 n θ j 2 \sum_{j=1}^n\theta_j^2 j=1nθj2 means to explicitly exclude the bias term, θ 0 \theta_0 θ0.I.e. the θ vector is indexed from 0 to n (holding n+1 values, θ 0 \theta_0 θ0 through θ n \theta_n θn), and this sum explicitly skips θ 0 \theta_0 θ0 , by running from 1 to n, skipping 0. Thus, when computing the equation, we should continuously update the two following equations:
【吴恩达机器学习】第三周课程精简笔记——对数几率回归和正则化_第30张图片
Advanced optimization
【吴恩达机器学习】第三周课程精简笔记——对数几率回归和正则化_第31张图片


我们可以用正则化线性回归的方法来正则化逻辑回归。 因此,我们可以避免过度拟合。 下图显示了由粉色线显示的正则化函数是如何比由蓝色线表示的非正则化函数更不容易过拟合的。 【吴恩达机器学习】第三周课程精简笔记——对数几率回归和正则化_第32张图片
代价函数
回想一下,logistic回归的代价函数是:
J ( θ ) = − 1 m ∑ i = 1 m [ y ( i ) l o g ( h θ ( x ( i ) ) ) + ( 1 − y ( i ) ) l o g ( 1 − h θ ( x ( i ) ) ) ] J(\theta)=-\frac{1}{m} \sum_{i=1}^m[y^{(i)}log(h_\theta(x^{(i)}))+(1-y^{(i)})log(1-h_\theta(x^{(i)}))] J(θ)=m1i=1m[y(i)log(hθ(x(i)))+(1y(i))log(1hθ(x(i)))]

我们可以正则化这个方程,方法是在最后加一项:
J ( θ ) = − 1 m ∑ i = 1 m [ y ( i ) l o g ( h θ ( x ( i ) ) ) + ( 1 − y ( i ) ) l o g ( 1 − h θ ( x ( i ) ) ) + λ 2 m ∑ i = 1 n θ j 2 ] J(\theta)=-\frac{1}{m} \sum_{i=1}^m[y^{(i)}log(h_\theta(x^{(i)}))+(1-y^{(i)})log(1-h_\theta(x^{(i)}))+\frac{\lambda}{2m}\sum_{i=1}^n\theta_j^2] J(θ)=m1i=1m[y(i)log(hθ(x(i)))+(1y(i))log(1hθ(x(i)))+2mλi=1nθj2]

第二项求和 ∑ j = 1 n θ j 2 \sum_{j=1}^n\theta_j^2 j=1nθj2 意味着明确排除偏差项 θ 0 \theta_0 θ0 。向量 θ \theta θ 的索引是从0到n(保留n+1个值, θ 0 \theta_0 θ0 θ n \theta_n θn),这个很明显地跳过了 θ 0 \theta_0 θ0,从(索引)1到n,跳过0。 因此,在计算方程时,我们需要不断更新以下两个方程:
【吴恩达机器学习】第三周课程精简笔记——对数几率回归和正则化_第33张图片
高级优化
【吴恩达机器学习】第三周课程精简笔记——对数几率回归和正则化_第34张图片

Exercise 2:对数几率回归和正则化

【吴恩达机器学习】Week3 编程作业——对数几率回归和正则化

你可能感兴趣的:(#,吴恩达机器学习笔记,机器学习,逻辑回归,分类算法)