Machine Learning - Stanford University

Course syllabus

Week 1:

  • Introduction
  • Model and Cost Function
  • Gradient Descent
  • Linear regression with one variable
  • Linear algebra review

Week 2:

  • Linear regression with multiple variables
  • Octave/Matlab Tutorial

Week 3:

  • Classification and Hypothesis Representation
  • Multiclass Classification
  • Logistic regression
  • Regularization

Week 4:

  • Neural networks: representation

Week 5:

  • Neural networks: learning
  • Cost Function and Backpropagation
  • Gradient Checking and Random Initialization

Week 6:

  • Advice for applying machine learning
  • Evaluating a learning algorithm
  • Bias vs. Variance
  • Handling Skewed Data

Week 7:

  • Support vector machines
  • Large Margin Classification
  • Kernels

Week 8:

  • Unsupervised learning
  • K-Means
  • Dimensionality reduction

Week 9:

  • Density Estimation
  • Anomaly detection
  • Recommender systems

Week 10:

  • Gradient Descent with Large Datasets

Week 11:

  • Application example: photo OCR

Software

  • Octave/MatLab

Week 1

Course content

  • Machine learning definition ( Field of study that gives computers the ability to learn without being explicitly programmed. )
  • Machine learning algorithms: Supervised learning and unsupervised learning, reinforcement learning, recommender systems.
  • Supervised learning (e.g., regression, classification, naive bayesian model, decision tree, random forest model, neural networks, support vector machines)
  • Unsupervised learning (clustering)
  • Model representation
    m = Number of training examples
    x = "input" variable/features
    y = "output" variable/"target" variable
    h = maps from x's to y's
  • Cost function

  • Bowl shape in 2D
  • Bowl shape in 3D
    Capture.PNG
  • Gradient descent algorithm
    Repeat until convergence:

    is called learning rate. A smaller would result in a smaller step and a larger results in a larger step. The direction in which the step is taken is determined by the partial derivative of . Depending on where one starts on the graph, one could end up at different points.
  • If is too small, gradient descent can be slow, while if is too large, gradient descent can overshoot the minimum. It may fail to converge, or even diverge.
  • As we approach a local minimum, gradient descent will automatically take smaller steps. So no need to decrease over time.
  • Repeat until convergence
    \theta_{0} := \theta_{0} - \alpha \frac{1}{m} \sum_{i=1}^{m}{(h_{\theta}(x^{(i)}) - y^{(i)})} \\ \theta_{1} := \theta_{1} - \alpha \frac{1}{m} \sum_{i=1}^{m}{((h_{\theta}(x^{(i)}) - y^{(i)})x^{(i)})}
  • "Batch": Each step of gradient descent uses all the training examples.
  • Vector is an n 1 matrix
  • Matrices are usually denoted by uppercase names while vectors are lowercase.
  • "Scalar" means that an object is a single value, not a vector or matrix.

Week 2

Course content

  • Multiple features
    - value of feature j in the training example
    - the input (features) of the training example
    - the number of training examples
    - the number of features
    Hypothesis(or called Prediction):
  • Cost function

    Vectorized implementation:
  • Gradient descent
    Repeat until convergence

    simultaneously update for
    Gradient descent implementation steps

    Vectorized implementation:
  • Feature scaling (get every feature into approximately a range)
    *Feature scaling speeds up gradient descent by making it require fewer iterations to get to a good solution. *
  • Mean normalization (Replace with to make features have approximately zero mean (Do not apply to ))

    With:
    - averaged value of x in training set
    - range(max-min) or standard deviation
  • Normal equation


    Normal equation example and vectorized implementation
  • Comparison of gradient descent and normal equation
Gradient Descent Normal equation
Need to choose No need to choose
Needs many iterations Don't need to iterate
Works well even when n (features) is large Slow if n (features) is very large
  • What if is non-invertible?
  1. Redundant features (linearly dependent)
  2. Too many features (e.g., ), then delete some features, or use regularisation.
  • Unvectorized implementation:
  • Vectorized implementation:

Programming Assignment: Linear regression


Week 3

Course content

  • Classification (Logistic regression):

  • Hypothesis representation
    logistic function (or sigmoid function):


  • Decision boundary

    Predict if or ;
    Predict if or .
  • Cost function
    Convex problem has one global optimum, while non convex problem has multiple local optimum and it is hard to find which one is global optimum.

    if
    if
    When , we get the following plot for vs

    When , we get the following plot for vs

    if
    if and
    if and
    Note that writing the cost function in this way guarantees that is convex for logistic regression.
  • Simplified logistic cost function:

    To fit parameter :

    To make a prediction given new :
    output
    Gradient descent with vectorized implementation is:
    Vectorized implementation:
  • Advanced optimization algorithms:
    1. Gradient descent
    2. Conjugate gradient
    3. BFGS
    4. L-BFGS
      With algorithms No.2,3 and 4, doesn't need to be picked manually, and they are often faster than gradient descent, but they are more complex.
function [J_value, gradient] = costFunction(theta)
  J_value = [code to compute cost function]
  gradient (1) = [code to compute partial derivative of cost function with respect to theta 0]
  .
  .
  .
  gradient (n+1) = [code to compute partial derivative of cost function with respect to theta n]
  • Multiclass classification
    one-vs-all (one-vs_rest):
    We are basically choosing one class and lumping all the others into a single second class. We do this repeatedly, applying binary logistic regression to each case, and then use the hypothesis that returns the highest value as our prediction.
  • Overfitting
    1. underfitting, high bias
    2. just right
    3. overfitting, high variance
      Definition of overfitting: if we have too many features, the learned hypothesis may fit the training set very well (), but fail to generalize to new examples.

Note: If we have a lot of features, and , very little training data, then, overfitting can become a problem.

  • How to address overfitting:
    1. Reduce number of features
      • Manually select which features to keep
      • Model selection algorithm
    2. Regularization
      • Keep all the features, but reduce magnitude/values of parameters
      • Works well when we have a lot of features, each of which contributes a bit to predicting .
  • In regularized linear regression, we choose to minimize:

    If is set to an extremely large value, algorithm results in underfitting (fails to fit even the training set)
  • Regularized linear regression
    gradient descent:

    where will always be less than 1.
    Normal equation:

    where

    L is a matrix with 0 at the top left and 1 is down the diagonal with 0's everywhere else. It should have dimension
    Recall that is then is non-invertible. However, when we add the term , then becomes invertible.
  • Regularized logistic regression- Gradient descent
    Repeat until convergence

    simultaneously update for

Programming Assignment: Logistic regression

  • Feature mapping
    One way to fit the data better is to create more features from each data point. In the provided function mapFeature.m, we will map the features into all polynomial terms of and up to the sixth power.
  • Different regularization parameter
  1. With a small , you should find that the classifier gets almost every training example correct, but draws a very complicated boundary, thus overfitting the data.
  2. If is set to too high a value, you will not get a good fit and the decision boundary will not follow the data so well, thus underfitting the data.

Week 4

Course content

  • Suppose that you are learning to recognize cars from 2×2 pixel images (grayscale, not RGB). Let the features be pixel intensity values. If you train logistic regression including all the quadratic terms as features. you will have 10 features. If there are more pixels, then there will be millions of features. Therefore, you need neural network to perform.
  • Model representation
    Neuron model: logistic unit
    sigmoid(logistic) activation function:
    weights means parameters in
    = "activation" of unit i in layer j
    = matrix of weights controlling function mapping from layer j to layer j+1
    if network has units in layer , units in layer , then will be of dimension
  • Forward propagation vectorized implementation



  • Examples and intuitions
    A simple example of applying neural networks is by predicting AND
    XNOR operator using a hidden layer with two nodes
  • Multiclass classification

Programming Assignment: Multi-class Classification and Neural Networks

  • You will be using multiple one-vs-all logistic regression models to build a multi-class classifier. Since there are 10 classes, you will need to train 10 separate logistic regression classifiers.
  • Regularized logistic regression
    Cost function:

    Partial derivative of regularized logistic regression cost for is defined as:


    For vectorized implementation, please check Matlab code uploaded in github.
  • One-vs-all classification
    In this part of the exercise, you will implement one-vs-all classification by training multiple regularized logistic regression classifiers, one for each of the classes in our dataset.
  • One-vs-all prediction
    *logistic regression cannot form more complex hypotheses as it is only a linear classier. (You could add more features such as polynomial features to logistic regression, but that can be very expensive to train.). The neural network will be able to represent complex models that form non-linear hypotheses. *
  • Your goal is to implement the feedforward propagation algorithm to use our trained weights for prediction. In next week's exercise, you will write the backpropagation algorithm for learning the neural network parameters.

Week 5

Course content

  • Neural network (classification)
    - total no. of layers in network
    - no. of units (not counting bias unit) in layer
    - number of output units/classes
  • Cost function of neural networks
    J(\Theta)=-\frac{1}{m}\sum_{i=1}^{m}\sum_{k=1}^{K}[y_k^{(i)}log \ (h_{\theta}(x^{(i)}))_k + (1-y_k^{(i)})log(1 - (h_{\theta}(x^{(i)}))_k)] + \frac{\lambda}{2m} \sum_{l=1}^{L-1} \sum_{i=1}^{s_l} \sum_{j=1}^{s_{l+1}}(\Theta_{ji}^{(l)})^2
  • backpropagation:
    "Backpropagation" is to minimizing cost function:
    backpropagation algorithm

    Note that is wrong in picture above, it should be as follows

    backpropagation algorithm.png
  • Unrolling parameters
    If the dimensions of Theta 1 is , Theta 2 is and Theta 3 is , then we "unroll" all the elements and put them into one long vecotr:
thetaVector = [Theta1(:); Theta2(:); Theta3(:);]
deltaVector = [D1(:); D2(:); D3(:)]

We can also get back our original matrices from the "unrolled" versions as follows:

Theta1 = reshape(thetaVector(1:100), 10, 11)
Theta2 = reshape(thetaVector(111:220), 10, 11)
Theta3 = reshape(thetaVector(221:231), 1, 11)
  • Gradient checking

    A small value for {\epsilon}ϵ (epsilon) such as , guarantees that the math works out properly.
epsilon = 1e-4;
for i = 1:n,
  thetaPlus = theta;
  thetaPlus(i) += epsilon;
  thetaMinus = theta;
  thetaMinus(i) -= epsilon;
  gradApprox(i) = (J(thetaPlus) - J(thetaMinus))/(2*epsilon)
end;

Once we compute gradApprox vector, we can check that gradApprox deltaVector.

  • Random initialization (symmetry breaking)
    Initializing each to a random value between .
If the dimensions of Theta1 is 10x11, Theta2 is 10x11 and Theta3 is 1x11.

Theta1 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta2 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta3 = rand(1,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
  • Training a neural network
  1. Randomly initialize the weights
  2. Implement forward propagation to get for any
  3. Implement the cost function
  4. Implement backpropagation to compute partial derivatives
  5. Use gradient checking to confirm that your backpropagation works. Then disable gradient checking.
  6. Use gradient descent or a built-in optimization function to minimize the cost function with the weights in theta.

Programming Assignment: Neural Networks Learning

  • Feedforward and regularized cost function
  • Backpropagation with random initialization
  • Adding regularization to gradient
  • Gradient checking
  • Learning a good set of parameters (Theta1, Theta2...) using fmincg()
  • Optional exercise: To see how the performance of the neural network varies with the regularization parameter and number of training steps (the MaxIter option when using fmincg)

Week 6

Course content

  • Debugging a learning algorithm
    1. Get more training examples
    2. Try smaller sets of features
    3. Try getting additional features
    4. Try adding polynomial features ( etc.)
    5. Try decreasing
    6. Try increasing
  • Typical split of dataset into training set and test set is 70% and 30% respectively.
  • The procedure using training set and test set:
    1. Learn and minimize using the training set
    2. Compute the test set error
  • The test set error
    1. For linear regression mean square error stuff.
    2. For classification 0/1 misclassification error
  • Model selection (e.g., degree of polynomial)
  • Train/(Cross) validation/Test sets (typically 60%/20%/20%)
  • We can now calculate three separate error values for the three different sets using the following method:
    1. Optimize the parameters in Θ using the training set for each polynomial degree.
    2. Find the polynomial degree d with the least error using the cross validation set.
    3. Estimate the generalization error using the test set with , (d = theta from polynomial with lower error);
  • High bias problem (underfitting problem) or high variance problem (overfitting problem)


    The relationship between degree of the polynomial d and the underfitting or overfitting of our hypothesis
  • Choosing the regularization parameter
    Screenshot 2020-12-08 at 20.31.32.png
  • In order to choose the model and the regularization term λ, we need to
    1. Create a list of lambdas (i.e. λ∈{0,0.01,0.02,0.04,0.08,0.16,0.32,0.64,1.28,2.56,5.12,10.24});
    2. Create a set of models with different degrees or any other variants.
    3. Iterate through the s and for each go through all the models to learn some .
    4. Compute the cross validation error using the learned Θ (computed with λ) on the without regularization or λ = 0.
    5. Select the best combo that produces the lowest error on the cross validation set.
    6. Using the best combo Θ and λ, apply it on to see if it has a good generalization of the problem.
  • Learning curve
    1. If a learning algorithm is suffering from high bias (underfitting), adding polynomial features, or obtaining additional features will help, but getting more training data will not (by itself) help much.


      Typical learning curve for high bias
    2. If a learning algorithm is suffering from high variance (overfitting), reducing feature set or increasing the regularization parameter or getting more training data is likely to help.


      Typical learning curve for high variance
  • "Small" neural network is prone to underfitting, while "large" neural network is more prone to overfitting (use regularization, e.g., increase to address the overfitting)
  • How to improve the accuracy of the classifier:
    1. Collect lots of data (for example "honeypot" project but doesn't always work)
    2. Develop sophisticated features (for example: using email header data in spam emails)
    3. Develop algorithms to process your input in different ways (recognizing misspellings in spam).
  • Error analysis
  • The recommended approach to solving machine learning problems is to:
    1. Start with a simple algorithm, implement it quickly, and test it early on your cross validation data.
    2. Plot learning curves to decide if more data, more features, etc. are likely to help.
    3. Manually examine the errors on examples in the cross validation set and try to spot a trend where most of the errors were made.
  • Error Metrics for Skewed Classes
    1. Precision/Recall


      Precision/Recall
  • Trading off precision and recall
  • F score

Programming Assignment: Regularized linear regression and bias vs. variance

  • Regularized linear regression
    1. Visualizing the dataset
    2. Regularized linear regression cost function
    3. Regularized linear regression gradient
    4. Fitting linear regression
  • Bias-variance
    1. Learning curve
  • Polynomial regression
    1. Learning polynomial regression
    2. Selecting lambda using a cross validation set
  • Large data rationale
    1. Use a learning algorithm with many parameters (e.g., logistic regression/linear regression with many features; neural network with many hidden units). - low bias algorithms
    2. Using a very large training set (unlikely to overfit) - low variance
      A combination of both hopefully will deliver a small
      Remember to normalizing polynomial features before using them
      If you are using cost function (linearRegCostFunction) to compute the training, cross validation and test error, you should call the function with the lambda argument set to 0. Do note that you will still need to use lambda when running the training to obtain the theta parameters.
      Add an error_metrics.m function to calculate error metrics.

Week 7

Course content

  • SVM hypothesis

    Hypothesis:
    if
    otherwise
  • SVM Decision Boundary
    C*0
    s.t. if
    if
  • Large margin intuition
    C is
  • Kernels and similarity:
    Gaussian kernel function:

    If :

    If is far from :
  • SVM with Kernels
  • SVM parameters
    Large C(): lower bias, high variance
    Small C(): higher bias, low variance
    Large : Features vary more smoothly, higher bias, lower variance
    Small : Feature vary less smoothly, lower bias, higher variance
  • Use SVM software package (e.g., liblinear, libsvm)
    Need to specify:
    • Choice of parameter C
    • choice of kernel (similarity function)
  • Kernel (similarity) function
    Do perform feature scaling before using the Gaussian kernel
  • Many off-the-shelf kernels available
    • Polynomial kernel
    • String kernel
  • n=number of features, m=number of training examples
    If n is large, use logistic regression or SVM without a kernel ("linear kernel")
    If n is small, m is intermediate, use SVM with Gaussian kernel
    If n is small, m is large, create/add more features, then use logistic regression or SVM without a kernel.
    Neural network likely to work well for most of these settings, but may be slower to train.

Programming Assignment: Support Vector Machines

  1. Part 1: using a Gaussian kernel with SVMs, 2D datasets
    The C parameter is a positive value that controls the penalty for misclassified training examples.
    Implementation Note:Most SVM software packages (including svmTrain.m) automatically add the extra feature for you and automatically take care of learning the intercept term . So when passing your training data to the SVM software, there is no need to add this extra feature yourself.
    In the last section of first part, Optimal and parameter is found by using the cross validation set.
    Recall that for classification, the error is defined as the fraction of the cross validation examples that were classified incorrectly. In MATLAB, you can compute this error using mean(double(predictions ~= yval)), where predictions is a vector containing all the predictions from the SVM, and yval are the true labels from the cross validation set.
  2. Part 2: spam classification

Week 8

Course content

  • Unsupervised learning
    While supervised learning algorithms need labeled examples (x,y), unsupervised learning algorithms need only the input (x).

  • K-means algorithm
    input:

    • K (number of clusters)
    • Training set
      := index (from 1 to K) of cluster centroid closest to
      := average (mean) of points assigned to cluster
  • Optimization objective
    J(c^{(1)}, ..., c^{(m)}, \mu_1, ..., \mu_K) = \frac{1}{m} \sum_{i=1}^m ||x^{(i)}-\mu_{c^{(i)}}||^2 \\ \min_{c^{(1)}, ..., c^{(m)}, \mu_1, ..., \mu_K} J(c^{(1)}, ..., c^{(m)}, \mu_1, ..., \mu_K)

  • K-means algorithm
    Randomly initialize cluster centroids
    Repeat{
    for i=1 to m
    index (from 1 to K) of cluster centroid closest to .
    for k=1 to K
    average (mean) of points assigned to cluster k.
    }

  • Random initialization
    Should have m is the number of samples
    Randomly pick training examples
    Set equal to these examples.
    Try multiple random initialization to avoid that k-mean is stuck in local optimum.

  • Choosing the number of clusters
    Elbow method
    Evaluate K-means based on a metric for how well it performs for that later purpose.

  • Dimensionality reduction

    1. Data compression (Maybe for batteries, this is not necessary, since there is lower dimensional data)
    2. Visualization
      After dimensionality reduction, k=2 or 3 since we can plot 2D or 3D data but don't have ways to visualize higher dimensional data.
  • Principal Component Analysis (PCA) problem formulation
    Reduce from n-dimension to k-dimension: Find k vectors () onto which to project the data so as to minimize the projection error.
    PCA is not linear regression

  • Principal Component Analysis algorithm
    Reduce data from n-dimensions to k-dimensions

    1. Compute "covariance matrix":
    2. Compute "eigen vectors" of matrix

      svd - singular value decomposition

    Obtaining

  • Reconstruction from compressed representation

  • Choosing the number of principal components k
    Typically, choose k to be smallest value so that

    or

    "99% of variance is retained"

  • Advice for applying PCA

    1. Supervised learning speedup
      Bad use of PCA: To prevent overfitting. Use regularization instead
  • Before implementing PCA, first try running whatever you want to do with the original/raw data . Only if that doesn't do what you want, then implement PCA and consider using .

Programming Assignment: K-Means clustering and principal component analysis

  1. K-Means clustering
    K-means algorithm is as follows:
% Initialize centroids
centroids = kMeansInitCentroids(X, K);
for iter = 1:iterations
    % Cluster assignment step: Assign each data point to the
    % closest centroid. idx(i) corresponds to c^(i), the index
    % of the centroid assigned to example i
    idx = findClosestCentroids(X, centroids);

    % Move centroid step: Compute means based on centroid
    % assignments
    centroids = computeMeans(X, idx, K);
end

1.1 Implementing K-Means
1.1.1 finding closest centroids
1.1.2 Computing centroid means


Week 9

Course content

  • Anomaly detection
    Check which have
  • Density estimation
    Training set:
    Each example is
  • Anomaly detection algorithm
  1. Choose features that you think might be indicative of anomalous examples, each example is .
  2. Fit parameters
  3. Given new example , compute :

    Anomaly if
  • Developing and evaluating an anomaly detection system.


    Comparison of anomaly detection and supervised learning
  • For non-gaussian features, some transformation on original data is needed in order to fit gaussian to it:
  1. log(x)
  2. log(x+c)
  • Choosing features that might take on unusually large or small values in the event of an anomaly
  • Multivariate Gaussian (Normal) distribution
    , don't model etc. separately. Model all in one go.
    Parameters:
Model expression Original model Multivariate Gaussian
expression p(x; \mu, \Sigma)
Correlation Manually create features to capture anomalies where take unusual combinations of values Automatically captures correlations between features
Computation power cheaper expensive
Application OK even if m(training set size) is small must have or else is non-invertible

Note: If is non-invertible, check 1. , 2. if there are redundant features.

  • Recommender systems
  • Collaborative filtering
  • vectorization: low rank matrix factorization
  • Mean normalization

This week's content is so similar to sensor fusion course.

Programming Assignment: Anomaly detection and recommender system

  1. Anomaly detection
    1.1 Gaussian distribution
    1.2 Estimating parameters for a Gaussian
    1.3 Selecting the threshold,
    1.4 High dimensional dataset
  2. Recommender systems
    2.1 Collaborative filtering cost
    2.2 Collaborative filtering gradient
    2.3 Regularized cost
    2.4 Regularized gradient

Week 10

Course content

  • Learning with large datasets
    "It is not who has the best algorithm that wins, it is who has the most data."
  • Stochastic gradient descent
    1. Randomly shuffle dataset
    2. Repeat{
      for {

      (for every
      }
      }
      Note:
  1. When the training set size is very large, stochastic gradient descent can be much faster than gradient descent.
  2. The cost function should go down with every literation of batch gradient descent but not necessarily with stochastic gradient descent.
  • Mini-batch gradient descent
    Batch gradient descent: Use all examples in each iteration.
    Stochastic gradient descent: Use 1 example in each iteration.
    Mini-batch gradient descent: Use examples in each iteration.
  • Checking for convergence
    Plot , averaged over the last 1000 examples. If curve is too noisy, try to increase number of examples or reducing learning rate .
    Learning rate is typically held constant can slowly decrease over time if we want to converge. (E.g., )
  • Online learning for large scale data
    Learn , which is called likelihood in sensor fusion course. stands for features here.
  • Map reduce and data parallelism
    Map-reduce

    Multi-core machines

    This week may be used in my project later in the future when we have gigabytes data available from cloud.

Quiz

  • As for linear regression with stochastic gradient descent, there is no need to apply map reduce techniques to them again, since computational power has already been decreased.
  • suppose that you are training a logistic regression classifier using stochastic gradient descent. It is found that the cost(, averaged over the last 500 examples), plotted as a function of the number of iterations, is slowly increasing over time. Then you should try decrease learning rate
  • Before running stochastic gradient descent, you should randomly shuffle the training set.

Week 11

Course content

  • Photo (Optical character recognition) OCR pipeline
    1. Text detection
    2. Character segmentation
    3. Character recognition
  • Sliding windows for text detection
  • Make sure you have a low bias classifier before getting more data.
  • Estimating the errors due to each component (ceiling analysis)
    What part of the pipeline should you spend the most time trying to improve?


    Ceiling analysis example

Questions arise from this course:

  1. Why don't go directly to deep learning methods instead of machine learning.
    Because feature matters more especially for batteries with less amount of data!
  2. Possible research ideas inspired by this course.

你可能感兴趣的:(Machine Learning - Stanford University)