Andrew Ng Machine Learning notes - Course1 Week1

Course 1: Supervised Machine Learning: Regression and Classification

Week 1: Introduction to Machine Learning

supervised learning v.s. unsupervised learning

supervised learning:

algorithms that learn x to y. give your learning algorithm examples to learn from, given “right answers” (output label).
e.g.

input(X) output(Y) application
email spam? (0/1) spam filtering
audio text transcript speech recognition
English Spanish machine translation
ad, user info click? (0/1) online advertising
image, radar info position of other cars self-driving car
image of phone defect? (0/1) visual inspection

Regression: predict a number from infinitely many possible outputs
Classification: predict categories from a small number of possible outputs

unsupervised learning:

given data that isn’t associated with any output label y, find some structure/pattern / something interesting in unlabeled data

Clustering: group similar data points together. e.g. Google news, DNA microarray, grouping customers
Anomaly Detection: find unusual data points. e.g. fraud detection
Dimensionality Reduction: compress data using fewer numbers

Regression model

Linear Regression with one variable

Notation:
x x x = “input” variable, feature
y y y = “output” variable, “target” variable
m m m = number of training examples
( x , y ) (x, y) (x,y) = single training example
( x ( i ) , y ( i ) ) (x^{(i)}, y^{(i)}) (x(i),y(i)) = i-th training example

Univariate linear regression: linear regression with one variable f w , b ( x ) = w x + b f_{w,b}(x) = wx+b fw,b(x)=wx+b

Cost Function:
squared-error cost function
J ( w , b ) = 1 2 m ∑ i = 1 m ( y ^ ( i ) − y ( i ) ) 2 J(w,b) = \frac{1}{2m} \sum_{i=1}^{m} (\hat{y}^{(i)}-y^{(i)})^2 J(w,b)=2m1i=1m(y^(i)y(i))2
where y ^ ( i ) = f w , b ( x ( i ) ) \hat{y}^{(i)} =f_{w,b}(x^{(i)}) y^(i)=fw,b(x(i))

bowl-shaped for squared-error cost function

Train the model with gradient descent

Gradient Descent:
repeat until convergence:
w = w − α ∂ ∂ w J ( w , b ) w = w - \alpha \frac{\partial}{\partial w} J(w,b) w=wαwJ(w,b) b = b − α ∂ ∂ b J ( w , b ) b = b - \alpha \frac{\partial}{\partial b} J(w,b) b=bαbJ(w,b)
where α \alpha α is the learning rate
Note: simultaneously update w w w and b b b. simultaneously means that you calculate the partial derivatives for all the parameters before updating any of the parameters.


Choosing a different starting point (even just a few steps away from the original starting point), may leading to the reached local minimum different.

Learning Rate:
if α \alpha α is too small, gradient descent will work but may be slow.
if α \alpha α is too large, gradient descent may overshoot and never reach minimum. May fail to converge, and even diverge

If already at a local minimum, gradient descent leaves w w w unchanged (since slope=0).

Gradient descent can reach local minimum with fixed learning rate. Because: as we get nearer a local minimum, gradient descent will automatically take smaller steps, since derivative automatically gets smaller.

Gradient Descent for Linear Regression:
w = w − α ∂ ∂ w J ( w , b ) w = w - \alpha \frac{\partial}{\partial w} J(w,b) w=wαwJ(w,b) b = b − α ∂ ∂ b J ( w , b ) b = b - \alpha \frac{\partial}{\partial b} J(w,b) b=bαbJ(w,b)
where
∂ ∂ w J ( w , b ) = 1 m ∑ i = 1 m ( f w , b ( x ( i ) ) − y ( i ) ) x ( i ) \frac{\partial}{\partial w} J(w,b) = \frac{1}{m} \sum_{i=1}^{m} (f_{w,b}(x^{(i)}) - y^{(i)})x^{(i)} wJ(w,b)=m1i=1m(fw,b(x(i))y(i))x(i) ∂ ∂ b J ( w , b ) = 1 m ∑ i = 1 m ( f w , b ( x ( i ) ) − y ( i ) ) \frac{\partial}{\partial b} J(w,b) = \frac{1}{m} \sum_{i=1}^{m} (f_{w,b}(x^{(i)}) - y^{(i)}) bJ(w,b)=m1i=1m(fw,b(x(i))y(i))
Andrew Ng Machine Learning notes - Course1 Week1_第1张图片
Squared-error cost function is a convex function, which has a single global minimum, because of the bowl shape. So as long as your learning rate is chosen appropriately, it will always converge to the global minimum.

“Batch” gradient descent: each step of gradient descent uses all the training examples.

你可能感兴趣的:(人工智能,深度学习)