一、DeepLearning Overview
•Train networks with many layers (vs. shallow nets with just a couple of layers)
•Multiplelayers work to build an improved feature space
–First layer learns 1st order features (e.g. edges…)
–2nd layer learns higher order features (combinations of first layer features, combinations of edges, etc.)
–Early layers usually learn in an unsupervised mode and discover general features of the input space – serving multiple tasks related to the unsupervised instances (image recognition, etc.)
–Then final layer features are fed into supervised layer(s)
•And entire network is often subsequently tuned using supervised training of the entire net, using the initial weightings learned in the unsupervised phase
–Could also do fully supervised versions, etc. (early BP attempts)
二、Why Deep Learning
•Biological Plausibility – e.g. Visual Cortex
•proof- Problems which can be represented with a polynomial number of nodes with k-layers, may require an exponential number of nodes with k-1 layers (e.g. parity)
•Highly varying functions can be efficiently represented with deep architectures
–Less weights/parameters to update than a less efficient shallow representation
•Sub-features created in deep architecture can potentially be shared between multiple tasks
–Type of Transfer/Multi-task learning
三、Difficulties of supervised training of deep networks
–Early layers of MLP do not get trained well
•Diffusion of Gradient – error attenuates as it propagates to earlier layers
•Leads to very slow training
•Exacerbated since top couple layers can usually learn any task "pretty well" and thus the error to earlier layers drops quickly as the top layers "mostly" solve the task– lower layers never get the opportunity to use their capacity to improve results
•Need a way for early layers to do effective work
–Often not enough labeled data available while there may be lots of unlabeled data
•Can we use unsupervised/semi-supervised approaches to take advantage of the unlabeled data
–Deep networks tend to have more local minima problems than shallow networks during supervised training
四、greedylayer-wise training
1.Train first layer using unlabeled data
–Could do supervised or semi-supervised but usually just use the larger amount of unlabeled data. Can also use labeled data but just leave out the label.
2.Then freeze the first layer parameters and start training the second layer using the output of the first layer as the unsupervised input to the second layer
3.Repeat this for as many layers as desired
–This builds our set of robust features
4.Use the outputs of the final layer as inputs to a supervised layer/model and train the last supervised layer(s)(leave early weights frozen)
5.Unfreeze all weights and fine tune the full network by training with a supervised approach, given the pre-processed weight settings
•Greedy layer-wise training avoids many of the problems of trying to train a deep net in a supervised fashion
–Each layer gets full learning focus in its turn since it is the only current"top" layer
–Can take advantage of the unlabeled data
–Once you finally start the supervised training portion the network weights have been adjusted so that you are already in a good error basin and now just need to fine tune. This helps with problems of
•Ineffective early layer learning
•Deep net work local minima
五、Unsupervised Learning
•When using Unsupervised Learning as a pre-processor to supervised learning you are typically given examples from the same distribution as the later supervised instances will come
–Assume the distribution comes from a set containing just examples from a defined set up possible output classes, but the label is not available (e.g. images of car vs trains vs motorcycles)
• In Self-Taught Learning we do not require that the later supervised instances come from the same distribution
–(e.g. Do Self-taught learning with any images, even though later you will do supervised learning with just cars,trains and motorcycles.)
–These types of distributions are more readily available than ones which just have the classes of interest
–However, if distributions are very different…
• New tasks share concepts/features from existing data and statistical regularities in the input distribution that many tasks can benefit from
–Note similarities to supervised Multi-task and Transfer learning
• Both approaches reasonable in Deep learning models
转自:http://wenku.baidu.com/view/472ac70c3169a4517723a398.html