Outline of PhD Course of Machine Learning in CMU

Introduction to Machine Learning - 10-701/15-781

Prof Alex Smola

Carnegie Mellon University 

Course URL: http://alex.smola.org/teaching/cmu2013-10-701/index.html

Overview

Machine learning studies the question “how can we build computer programs that automatically improve their performance through experience?” This includes learning to perform many types of tasks based on many types of experience. For example, it includes robots learning to better navigate based on experience gained by roaming their environments, medical decision aids that learn to predict which therapies work best for which diseases based on data mining of historical health records, and speech recognition systems that learn to better understand your speech based on experience listening to you.

This course is designed to give PhD students a thorough grounding in the methods, theory, mathematics and algorithms needed to do research and applications in machine learning. The topics of the course draw from machine learning, classical statistics, data mining, Bayesian statistics and information theory. Students entering the class with a pre-existing working knowledge of probability, statistics and algorithms will be at an advantage, but the class has been designed so that anyone with a strong numerate background can catch up and fully participate. If you are interested in this topic, but are not a PhD student, or are a PhD student not specializing in machine learning, you might consider Roni Rosenfeld's master's level course on Machine Learning, 10-601.

Syllabus

1. Introduction to Machine Learning

Applications

Optical character recognition

Bio informatics

Computational advertising

Self-driving cars

Network security

Programming with data

Classification, Regression, Annotation

Finding unusual events

Data

Supervised

Unsupervised

Semi-supervised and transductive inference

Basic tools

Linear classification, regression

Feature maps

Trees

Instance based classifiers

Challenges

Model selection, underfitting, overfitting

Validation, confidence

Explore exploit reactive environment

2. Basic Statistics

Probabilities

Dependence, independence, conditional probabilities

Bayes rule and Chain rule

Paradoxes in measure theory

Parameter estimation

Maximum Likelihood Estimation (MLE)

Maximum a Posterior Estimation (MAP)

Application I - Naive Bayes for spam filtering

Discrete features in Naive Bayes

Estimating parameters

Finite sample size problems

Application II - Naive Bayes for fMRI data processing

Continuous features in Naive Bayes

3. Instance Based Learning

Parzen windows

Basic idea (smoothing over empirical average)

Kernels

Model selection

Over-fitting and Under-fitting

Cross validation and leave-one-out estimation

Bias-variance trade off

Curse of dimensionality

Watson-Nadaraya estimators

Regression

Classification

Nearest neighbor estimator

Limit case via Parzen

Fast lookup

4. Perceptron

Application - Hebbian learning

Perceptron

Algorithm

Convergence proof

Properties

Kernel trick

Basic idea

Kernel Perceptron

Kernel expansion

Kernel examples

5. Support Vector Classification

Application - Optical Character Recognition

Support Vector Machines

Large Margin Classification

Optimization Problem

Dual Problem

Properties

Support Vectors

Support Vector expansion

Soft Margin Classifier

Noise tolerance

Optimization problem

Dual problem

6. Kernels

Kernel properties

Hilbert spaces

Convex cone, distances, norms

Mercer's theorem

Representer theorem

Kernel expansion

Kernel trick in general

Regularization and features

Shotgun approach

Regularization

Explicit feature map

Hash kernels

Application - spam filtering

Hashing trick

7. Kernels

Application - jet engine failure detection

Support Vector Novelty Detection

Density estimation

Thresholding and scaling

Optimization problem and dual

Online setting

Support Vector Regression

Loss functions

Optimization problem and dual

Examples

Kernel PCA

Principal Component Analysis

Kernelization

8. Stochastic Convergence

Convergence properties

Weak convergence

Convergence in probability

Strong (almost surely)

Guarantees

Law of large numbers

9. Tail bounds

Central limit theorem

Tail bounds (Markov, Chebyshev, Hoeffding, Bernstein, McDiarmid inequality)

Application - A/B testing for page layout

10. Risk Minimization

Risk and loss

Loss functions

Risk

Empirical risk vs True risk

Empirical Risk minimization

Underfitting and Overfitting

Examples

Classification

Regression

11. Learning Theory

Application of McDiarmid's inequality

Infinite many hypothesis

Uniform convergence bounds

PAC bound for estimation error

Structural risk minimzation

Vapnik Chervonenkis dimension

Shattering

Growth Function

Sauer's lemma

Vapnik-Chervonenkis inequality

Vapnik-Chervonenkis theorem

12. Gaussian Processes

Motivation - estimating height & weight

Gaussian Process

Mean function

Covariance function

Relation to kernels

Gaussian Process regression

Jointly normal distribution

Adding noise

Schur complement

Conditional Exponential Models

Logistic regression

Connection to SVMs

Poisson regression

13. Exponential Families

Definition

Properties (expectations, covariance)

Multinomial, Gauss, Binary, Poisson

Principal Component Analysis

Inference

Conjugate Priors

Norm-based priors

Maximum likelihood and Maximum-a-posteriori inference

Examples - Beta, Dirichlet, Gamma, Wishart

14. Principal Component Analysis

Definition

PCA algorithms

Applications

Eigen Faces

Image Compression

Image Denosing

15. Directed Graphical Models

Application: Sprinkler, earthquake and burglar

Conditional probabilities

Joint distribution

Parameter estimation for fully observed models

Statistical model

Conditional independence

Effects of observing variables (explaining away etc.)

Causal implications

16. Dynamic Programming

Application - Route Planning

Dynamic programming

Problem

Basic idea

Examples

Chains

Trees

Hypergraphs

Loops

Generalized Distributive Law

Semirings

Examples

17. Latent Variable Models

Application - clustering users for a web service

Variational inference

Problem

Variational inequality

Mixtures of Gaussians

Statistical model

Variational inequality

Expectation Maximization algorithm

k-means algorithm

Limit case of mixture of Gaussians

Simple algorithm

Hidden Markov Models

Statistical model

Variational inequality

Dynamic programming (Viterbi and Baum-Welch)

18. Sampling

Motivation

Intractable distributions

Application - stock market, parameter inference

Algorithms

Rejection sampling

Gibbs sampling

Metropolis Hastings sampler

Markov Chain Monte Carlo (just basic idea)

Markov Chains

19. Information Theory

Entropy

Basic properties, decomposition

Kullback-Leibler Divergence (with properties)

Mutual Information

Examples

Exponential families

Gaussian mutual information as special case

Application - cocktail party problem

Independent Component Analysis

JADE, Radical, InfoMax, and FastICA (time permitting)

20. Decision Trees

Application - an engineer discovers machine learning

Trees

Rules (if then else)

Objective functions & criteria for splitting

Pruning

Fast implementations

Acceleration by Chernoff bounds

Data distribution

21. Neural Networks

Application - finding cats on the internet without trying

Nonlinear basis functions

Optimizing weights

Optimizing function classes

Multiple layers

Hidden layers

Backprop

Architectures

Autoencoder

Boltzmann machine

Convolutional network

22. Boosting

Application - combining experts’ advice

Boosting problem

Weak learners

Multiplicative updates

Convergence bounds

Properties

Noise sensitiviy

Modifications for robustness (e.g. rho-boost)

Applications - Viola-Jones face detection

23. Kalman Filter

Application - robot navigation

Linear Dynamical systems

Linear dependence

Linear observation model

Forward smoothing

Inference

Rauch-Tung-Striebel smoother

Parameter inference

Gauss elimination

24. Reinforcement Learning

Application - operating a robot

Temporal Difference Learning

Q Learning

Value function

Partially Observed Markov Decision Process

Policy methods

Policy evaluation

Policy iteration

Policy gradient

25. Scalability

Application - building a startup

Infrastructure

Computers and Clouds

Storage

Fault tolerance

Processing paradigms

Map Reduce

Streaming

(key,value) store

Algorithmic templates

Consistent hashing

Distributed hash table

Machine learning

Distributed optimization

Distributed sampling

你可能感兴趣的:(Outline of PhD Course of Machine Learning in CMU)