algorithms that learn x to y. give your learning algorithm examples to learn from, given “right answers” (output label).
e.g.
input(X) | output(Y) | application |
---|---|---|
spam? (0/1) | spam filtering | |
audio | text transcript | speech recognition |
English | Spanish | machine translation |
ad, user info | click? (0/1) | online advertising |
image, radar info | position of other cars | self-driving car |
image of phone | defect? (0/1) | visual inspection |
Regression: predict a number from infinitely many possible outputs
Classification: predict categories from a small number of possible outputs
given data that isn’t associated with any output label y, find some structure/pattern / something interesting in unlabeled data
Clustering: group similar data points together. e.g. Google news, DNA microarray, grouping customers
Anomaly Detection: find unusual data points. e.g. fraud detection
Dimensionality Reduction: compress data using fewer numbers
Notation:
x x x = “input” variable, feature
y y y = “output” variable, “target” variable
m m m = number of training examples
( x , y ) (x, y) (x,y) = single training example
( x ( i ) , y ( i ) ) (x^{(i)}, y^{(i)}) (x(i),y(i)) = i-th training example
Univariate linear regression: linear regression with one variable f w , b ( x ) = w x + b f_{w,b}(x) = wx+b fw,b(x)=wx+b
Cost Function:
squared-error cost function
J ( w , b ) = 1 2 m ∑ i = 1 m ( y ^ ( i ) − y ( i ) ) 2 J(w,b) = \frac{1}{2m} \sum_{i=1}^{m} (\hat{y}^{(i)}-y^{(i)})^2 J(w,b)=2m1i=1∑m(y^(i)−y(i))2
where y ^ ( i ) = f w , b ( x ( i ) ) \hat{y}^{(i)} =f_{w,b}(x^{(i)}) y^(i)=fw,b(x(i))
bowl-shaped for squared-error cost function
Gradient Descent:
repeat until convergence:
w = w − α ∂ ∂ w J ( w , b ) w = w - \alpha \frac{\partial}{\partial w} J(w,b) w=w−α∂w∂J(w,b) b = b − α ∂ ∂ b J ( w , b ) b = b - \alpha \frac{\partial}{\partial b} J(w,b) b=b−α∂b∂J(w,b)
where α \alpha α is the learning rate
Note: simultaneously update w w w and b b b. simultaneously means that you calculate the partial derivatives for all the parameters before updating any of the parameters.
Choosing a different starting point (even just a few steps away from the original starting point), may leading to the reached local minimum different.
Learning Rate:
if α \alpha α is too small, gradient descent will work but may be slow.
if α \alpha α is too large, gradient descent may overshoot and never reach minimum. May fail to converge, and even diverge
If already at a local minimum, gradient descent leaves w w w unchanged (since slope=0).
Gradient descent can reach local minimum with fixed learning rate. Because: as we get nearer a local minimum, gradient descent will automatically take smaller steps, since derivative automatically gets smaller.
Gradient Descent for Linear Regression:
w = w − α ∂ ∂ w J ( w , b ) w = w - \alpha \frac{\partial}{\partial w} J(w,b) w=w−α∂w∂J(w,b) b = b − α ∂ ∂ b J ( w , b ) b = b - \alpha \frac{\partial}{\partial b} J(w,b) b=b−α∂b∂J(w,b)
where
∂ ∂ w J ( w , b ) = 1 m ∑ i = 1 m ( f w , b ( x ( i ) ) − y ( i ) ) x ( i ) \frac{\partial}{\partial w} J(w,b) = \frac{1}{m} \sum_{i=1}^{m} (f_{w,b}(x^{(i)}) - y^{(i)})x^{(i)} ∂w∂J(w,b)=m1i=1∑m(fw,b(x(i))−y(i))x(i) ∂ ∂ b J ( w , b ) = 1 m ∑ i = 1 m ( f w , b ( x ( i ) ) − y ( i ) ) \frac{\partial}{\partial b} J(w,b) = \frac{1}{m} \sum_{i=1}^{m} (f_{w,b}(x^{(i)}) - y^{(i)}) ∂b∂J(w,b)=m1i=1∑m(fw,b(x(i))−y(i))
Squared-error cost function is a convex function, which has a single global minimum, because of the bowl shape. So as long as your learning rate is chosen appropriately, it will always converge to the global minimum.
“Batch” gradient descent: each step of gradient descent uses all the training examples.