NN
- NN
-
- soft-max loss function
- 5a Things that make it hard to recognize objects
- 5b how to achieve viewpoint invariance
- 5c cnn for hand-written digit recognition
- BP for CNNs
- what does the replicating the feature detectors achieve
- pooling
- LeNet
- 5d CNNs for object recognition
- 6a overview of mini-batch gradient descent
- 6b tricks of stochastic gradient descent
- Four ways to speed up mini-batch learning
- 6c The momentum method
- A better type of momentum
- 6d A separate adaptive learning rate for each connection
- One way to determine the individual learning rates
- 7a Modeling sequences A brief overview
- 7b Training RNNs with backpropagation
- 7c A toy example of training an RNN
- 7d Why it is difficult to train an RNNs
- Four effective ways to learn an RNN
- 8a HF Optimization
- 8b modeling character strings with multiplicative connections
- why model character strings
- 9a overview of ways to improve generalization
- how to prevent overfitting
- how to limit the capacity of a NN
- cross-validation
- early stopping
- 9b limiting the size of wights
- 9c Using noise as a regularizer
- add noisy to inputs
- add noise to the weights
- Using noise in the activities as a regularizer
- 9d introduction to Bayesian Approach
- 9e the Bayesian interpretation of weight decay
- 9f MacKeys quick and dirty method of fixing weight costs
- 10a why it helps to combine models
perceptron
pretty limited
对于线性不可分的无手段
需要hidden units
soft-max loss function
∂C∂zi=yi−ti
where
C=−∑jtjlogyj
where
tj is the target value and
∑tj=1
∂C∂zi=∑j∂C∂yj∂yj∂zi
yi=ezi∑jezj
so
when i=j
∂yj∂zi=yi(1−yi)
else
∂yj∂zi=−yiyj
Besides
∂C∂yj=−tjyj
thus
∑j∂C∂yj∂yj∂zi=−ti(1−yi)+∑j≠i−tjyj(−yiyj)=−ti(1−yi)+∑j≠itjyi=−ti+tiyi+∑j≠itjyi=−ti+yi∑jtj=yi−ti
5a Things that make it hard to recognize objects
- segmentation: real scenes are cluttered with other objects
- it is hard to tell which pieces go together as parts of the same object
- parts of an object can be hidden bebind other objects
- lighting: the intensities of the pixels are determinded as much by the lighting as by the objects.
- deformation: objects can deform in a variety of non-affine ways
- affordances: object classes are often defined by how thery are used.
- viewpiont: changes in viewpiont cause changes in images that standard learning methods cannot cope with.
5b how to achieve viewpoint invariance
- use redundant invariant features
- but for recognition, we must avoid forming features from parts of different objects
- put a box around the object and use normalized pixels
- but choosing such a box is very difficult and we need to recognize the shape to get the box right!
- the brute force normalization approach: try all possible boxes in a range of positions and scales
- use replicated features with pooling.
cnn
- use a hierarchy of parts that have explicit poses relative to the camera
5c cnn for hand-written digit recognition
BP for CNNs
if we need make w1=w2 (because of weight sharing)
we need Δw1=Δw2
and thus we compute ∂E∂w1 and ∂E∂w2 and use ∂E∂w1+∂E∂w2 for both w1 and w2
what does the replicating the feature detectors achieve?
- equivariant activities
- invariant knowlege
pooling
- translational invariance
- reducing the number of input to next layor
- problem: lose information about the precise position.
LeNet
LeNet
Here is the architecture of LeNet-5.
5d CNNs for object recognition
from hand-written digits to 3-D objects
6a overview of mini-batch gradient descent
- online: update weights after each case; however, mini-batches are usually better than online.
- stochastic gradient descent
6b tricks of stochastic gradient descent
- initializing weights with small random values
- shifting the inputs: (101,101,2) (101,99,0) → (1,1,2) (1,-1,0)
- scaling the inputs: (0.1, 10, 2) (0.1, -10, 0) → (1,1,2) (1,-1,0)
- decorrelating the input components: PCA (Principal Components Analysis)
Four ways to speed up mini-batch learning
- Use “momentum”
- Use separate adaptive learning rates for each parameter
- rmsprop
- Take a fancy method from the optimization literature that makes use of curvature information.
6c The momentum method
v(t)=αv(t−1)−ε∂E∂w(t)
where
α is slightly less than 1.
Δw(t)=v(t)=αv(t−1)−ε∂E∂w(t)=αΔw(t−1)−ε∂E∂w(t)
A better type of momentum
Nesterov 1983
- The standard momentum method
first
computes the gradient at the current location and then
takes a big jump in the direction of the updated accumulated gradient.
- Ilya Sutskever(2012):
First
make a big jump in the direction of the previous accumulated gradient. Then
measure the gradient where you end up and make a correction.
It’s better to correct a mistake after you have made it!
6d A separate, adaptive learning rate for each connection
each connection in the NN should have its own adaptive learning rate.
The magnitudes of the gradients are often very different for different layers
One way to determine the individual learning rates
- Start with a local gain of 1 for every weight.
- Increase the local gain if the gradient for that weight does not change sign
Δwij=−εgij∂E∂wij
if
∂E∂wij(t)∂E∂wij(t−1)>0
then
gij(t)=gij(t−1)+δ
else
gij(t)=gij(t−1)×(1−δ)
for example δ=0.05
7a Modeling sequences: A brief overview
targets
- turn an input sequence into an output sequence that lives in a different domain.
- predict the next term in the input sequence.
7b Training RNNs with backpropagation
7c A toy example of training an RNN
7d Why it is difficult to train an RNNs?
- The backward pass is linear
- The problem of exploding or vanishing gradients
Four effective ways to learn an RNN
- Long Short Term Memory
Hochreiter & Schmidhuber (1997)
- Hessian Free Optimization
- Echo State Networks
- Good initialization with momentum
8a HF Optimization
I will come back later.
8b modeling character strings with multiplicative connections
why model character strings.
- The web is composed of character strings.
- pre-processing text to get words is a big hassle.
9a overview of ways to improve generalization
- overfitting: the model cannot figure which regurarities are real and which are caused by sampled errors.
how to prevent overfitting
- more data
- use a model with the right capacity
- average many different models
- a single NN architecture, but make prediction by many different vectors
how to limit the capacity of a NN
- architecture: limit the number of hidden layers and units per layer
- early stopping
- weight-decay
- add noise to the weights or the activities.
cross-validation
- training set
- validation set
- test set
N-fold cross-validation
is not independent.
early stopping
however, it’s hard to decide when performance is getting worse.
9b limiting the size of wights
The standard L2 weight penalty involves adding an extra term to the cost function that penalizes the squared weights.
This keeps the weights small unless they have big error derivatives. It prevents network from using the weights that it doesn’t need.
C=E+λ2∑iw2i
∂C∂wi=∂E∂wi+λwi
when
∂C∂wi=0
,
wi=−1λ∂E∂wi
9c Using noise as a regularizer
Suppose we add Gaussian noise to the inputs
then, the input will be
xi+N(0,σ2i)
and the output turn out to be
yi+N(0,w2iσ2i)
if we try to minimize the squared error, we tends to minimize the squared weights.
how does it work?
ynoisy=∑iwixi+∑iwiεi
where εi is sampled from N(0,σ2i) .
E[(ynoisy−t)2]=E[(y+∑iwiεi−t)2]=E[((y−t)+∑iwiεi)2]=(y−t)2+E[2(y−t)∑iwiεi]+E[(∑iwiεi)2]=(y−t)2+E[∑iw2iε2i]=(y−t)2+∑iw2iε2i
Because εi is independent of εj .
Thus, we can see that σ2i is equivalent to a L2 penalty.
add noise to the weights
Adding noise to a multilayer non-linear neural net is not exactly equivalent to L2 penalty. However, it may work better, especially in RNN.
Alex Grave
’s RNN that recognizes handwriting.
Using noise in the activities as a regularizer
It does worse on the training set and trains considerably slower. Nevertheless, it does significantly better on the test set!!! (~(≧▽≦)/~)
9d introduction to Bayesian Approach
Assumption: we always have a prior distribution for everything
- Prior may be vague.
- When we see some data, we combine our prior distribution with a likelihood term to get a posterior distribution
- It favors parameter setting that make the data likely
9e the Bayesian interpretation of weight decay
Explain what’s really going on when we use weight decay to control the NN’s capacity.
Supervised Maximum Likelihood Learning
output of the net:
yc=f(inputc,W)
the probability density of the target value given output + Gaussian noise:
p(tc|yc)=12π‾‾‾√σe−(tc−yc)22σ2
−logp(tc|yc)=k+(tc−yc)22σ2
Thus, if we minimize the squared error, we maximize the log probability under a Gaussian.
Why log
Because it can change times into plus.
MAP: maximum a posterior
p(W|D)=p(W)p(D|W)p(D)
Cost=−logp(W|D)=−logp(W)−logp(D|W)+logp(D)
where logp(D) is constant. Thus,
C∗=12σ2D∑c(yc−tc)2+12σ2W∑iw2i
C=E+σ2Dσ2W∑iw2i
This is the weight penalty.
9f MacKey’s quick and dirty method of fixing weight costs
10a why it helps to combine models