Hinton's lectures( NN for ML) from lecture 5 to lecture 9

NN

  • NN
      • soft-max loss function
      • 5a Things that make it hard to recognize objects
      • 5b how to achieve viewpoint invariance
      • 5c cnn for hand-written digit recognition
        • BP for CNNs
        • what does the replicating the feature detectors achieve
        • pooling
        • LeNet
      • 5d CNNs for object recognition
      • 6a overview of mini-batch gradient descent
      • 6b tricks of stochastic gradient descent
        • Four ways to speed up mini-batch learning
      • 6c The momentum method
        • A better type of momentum
      • 6d A separate adaptive learning rate for each connection
        • One way to determine the individual learning rates
      • 7a Modeling sequences A brief overview
      • 7b Training RNNs with backpropagation
      • 7c A toy example of training an RNN
      • 7d Why it is difficult to train an RNNs
        • Four effective ways to learn an RNN
      • 8a HF Optimization
      • 8b modeling character strings with multiplicative connections
        • why model character strings
      • 9a overview of ways to improve generalization
        • how to prevent overfitting
        • how to limit the capacity of a NN
        • cross-validation
        • early stopping
      • 9b limiting the size of wights
      • 9c Using noise as a regularizer
        • add noisy to inputs
        • add noise to the weights
        • Using noise in the activities as a regularizer
      • 9d introduction to Bayesian Approach
      • 9e the Bayesian interpretation of weight decay
        • MAP maximum a posterior
      • 9f MacKeys quick and dirty method of fixing weight costs
      • 10a why it helps to combine models

perceptron

pretty limited
对于线性不可分的无手段
需要hidden units

soft-max loss function

Czi=yiti

where
C=jtjlogyj

where tj is the target value and tj=1

Czi=jCyjyjzi

yi=ezijezj

so
when i=j

yjzi=yi(1yi)

else

yjzi=yiyj

Besides

Cyj=tjyj

thus

jCyjyjzi=ti(1yi)+jitjyj(yiyj)=ti(1yi)+jitjyi=ti+tiyi+jitjyi=ti+yijtj=yiti


5a Things that make it hard to recognize objects

  • segmentation: real scenes are cluttered with other objects
    • it is hard to tell which pieces go together as parts of the same object
    • parts of an object can be hidden bebind other objects
  • lighting: the intensities of the pixels are determinded as much by the lighting as by the objects.
  • deformation: objects can deform in a variety of non-affine ways
  • affordances: object classes are often defined by how thery are used.
  • viewpiont: changes in viewpiont cause changes in images that standard learning methods cannot cope with.

5b how to achieve viewpoint invariance

  • use redundant invariant features
    • but for recognition, we must avoid forming features from parts of different objects
  • put a box around the object and use normalized pixels
    • but choosing such a box is very difficult and we need to recognize the shape to get the box right!
    • the brute force normalization approach: try all possible boxes in a range of positions and scales
  • use replicated features with pooling.cnn
  • use a hierarchy of parts that have explicit poses relative to the camera

5c cnn for hand-written digit recognition

BP for CNNs

if we need make w1=w2 (because of weight sharing)
we need Δw1=Δw2
and thus we compute Ew1 and Ew2 and use Ew1+Ew2 for both w1 and w2

what does the replicating the feature detectors achieve?

  • equivariant activities
  • invariant knowlege

pooling

  • translational invariance
  • reducing the number of input to next layor
  • problem: lose information about the precise position.

LeNet

LeNet
Here is the architecture of LeNet-5.


5d CNNs for object recognition

from hand-written digits to 3-D objects


6a overview of mini-batch gradient descent

  • online: update weights after each case; however, mini-batches are usually better than online.
  • stochastic gradient descent

6b tricks of stochastic gradient descent

  • initializing weights with small random values
  • shifting the inputs: (101,101,2) (101,99,0) (1,1,2) (1,-1,0)
  • scaling the inputs: (0.1, 10, 2) (0.1, -10, 0) (1,1,2) (1,-1,0)
  • decorrelating the input components: PCA (Principal Components Analysis)

Four ways to speed up mini-batch learning

  1. Use “momentum”
  2. Use separate adaptive learning rates for each parameter
  3. rmsprop
  4. Take a fancy method from the optimization literature that makes use of curvature information.

6c The momentum method

v(t)=αv(t1)εEw(t)

where α is slightly less than 1.
Δw(t)=v(t)=αv(t1)εEw(t)=αΔw(t1)εEw(t)

A better type of momentum

Nesterov 1983

  • The standard momentum method first computes the gradient at the current location and then takes a big jump in the direction of the updated accumulated gradient.
  • Ilya Sutskever(2012): First make a big jump in the direction of the previous accumulated gradient. Then measure the gradient where you end up and make a correction.

It’s better to correct a mistake after you have made it!


6d A separate, adaptive learning rate for each connection

each connection in the NN should have its own adaptive learning rate.
The magnitudes of the gradients are often very different for different layers

One way to determine the individual learning rates

  • Start with a local gain of 1 for every weight.
  • Increase the local gain if the gradient for that weight does not change sign

Δwij=εgijEwij

if

Ewij(t)Ewij(t1)>0

then

gij(t)=gij(t1)+δ

else

gij(t)=gij(t1)×(1δ)

for example δ=0.05


7a Modeling sequences: A brief overview

targets

  • turn an input sequence into an output sequence that lives in a different domain.
  • predict the next term in the input sequence.
  • memoryless models for sequences

    • Autoregressive models
    • Feed-forward neural nets: generalizing autoregressive models by using one or more layers of non-linear hidden units.
  • Linear Dynamical Systems: has hidden units which store information

  • Hidden Markov Models(HMM): have a discrete one-of-N hidden state. Transitions between states are stochastic and controlled by a transition matrix. The outputs produced by a state are stochastic. More detailed information about HMM
  • Recurrent neural networks (I will provide a clear picutre about RNNs in my new blog)

7b Training RNNs with backpropagation


7c A toy example of training an RNN


7d Why it is difficult to train an RNNs?

  • The backward pass is linear
  • The problem of exploding or vanishing gradients

Four effective ways to learn an RNN

  • Long Short Term Memory
    • Hochreiter & Schmidhuber (1997)
  • Hessian Free Optimization
  • Echo State Networks
  • Good initialization with momentum

8a HF Optimization

I will come back later.


8b modeling character strings with multiplicative connections

why model character strings.

  • The web is composed of character strings.
  • pre-processing text to get words is a big hassle.

9a overview of ways to improve generalization

  • overfitting: the model cannot figure which regurarities are real and which are caused by sampled errors.

how to prevent overfitting

  • more data
  • use a model with the right capacity
  • average many different models
  • a single NN architecture, but make prediction by many different vectors

how to limit the capacity of a NN

  • architecture: limit the number of hidden layers and units per layer
  • early stopping
  • weight-decay
  • add noise to the weights or the activities.

cross-validation

  • training set
  • validation set
  • test set
    N-fold cross-validation is not independent.

early stopping

however, it’s hard to decide when performance is getting worse.


9b limiting the size of wights

The standard L2 weight penalty involves adding an extra term to the cost function that penalizes the squared weights.

This keeps the weights small unless they have big error derivatives. It prevents network from using the weights that it doesn’t need.

C=E+λ2iw2i

Cwi=Ewi+λwi

when

Cwi=0
,

wi=1λEwi


9c Using noise as a regularizer

add noisy to inputs

Suppose we add Gaussian noise to the inputs
then, the input will be

xi+N(0,σ2i)

and the output turn out to be
yi+N(0,w2iσ2i)

if we try to minimize the squared error, we tends to minimize the squared weights.

how does it work?

ynoisy=iwixi+iwiεi

where εi is sampled from N(0,σ2i) .

E[(ynoisyt)2]=E[(y+iwiεit)2]=E[((yt)+iwiεi)2]=(yt)2+E[2(yt)iwiεi]+E[(iwiεi)2]=(yt)2+E[iw2iε2i]=(yt)2+iw2iε2i

Because εi is independent of εj .

Thus, we can see that σ2i is equivalent to a L2 penalty.

add noise to the weights

Adding noise to a multilayer non-linear neural net is not exactly equivalent to L2 penalty. However, it may work better, especially in RNN.

Alex Grave’s RNN that recognizes handwriting.

Using noise in the activities as a regularizer

It does worse on the training set and trains considerably slower. Nevertheless, it does significantly better on the test set!!! (~(≧▽≦)/~)


9d introduction to Bayesian Approach

Assumption: we always have a prior distribution for everything

  • Prior may be vague.
  • When we see some data, we combine our prior distribution with a likelihood term to get a posterior distribution
  • It favors parameter setting that make the data likely

9e the Bayesian interpretation of weight decay

Explain what’s really going on when we use weight decay to control the NN’s capacity.

Supervised Maximum Likelihood Learning

output of the net:

yc=f(inputc,W)

the probability density of the target value given output + Gaussian noise:

p(tc|yc)=12πσe(tcyc)22σ2

logp(tc|yc)=k+(tcyc)22σ2

Thus, if we minimize the squared error, we maximize the log probability under a Gaussian.

Why log
Because it can change times into plus.

MAP: maximum a posterior

p(W|D)=p(W)p(D|W)p(D)

Cost=logp(W|D)=logp(W)logp(D|W)+logp(D)

where logp(D) is constant. Thus,

C=12σ2Dc(yctc)2+12σ2Wiw2i

C=E+σ2Dσ2Wiw2i

This is the weight penalty.


9f MacKey’s quick and dirty method of fixing weight costs


10a why it helps to combine models

你可能感兴趣的:(Hinton's lectures( NN for ML) from lecture 5 to lecture 9)