2018/11/22 Homework Seven (Grads Only)https://elearning.mines.edu/courses/9621/assignments/41937 1/12Homework Seven (Grads Only)Due Dec 3 by 3pm Points 100 Submitting a file upload Available after Oct 31 at 12amSubmit AssignmentFor this assignment, you will be asked to train a deep convolutional neural network from scratch.We will be learning about many of the concepts needed in this homework assignment in the weeks of 11/12, 11/19,and 11/26. However, you are advised to begin teaching yourself about Neural Networks *now* and refine your work aswe cover material. The MATLAB website has numerous tutorials on performing Deep Learning inMATLAB. https://www.mathworks.com/solutions/deep-learning/examples/training-a-model-from-scratch.html(https://www.mathworks.com/solutions/deep-learning/examples/training-a-model-from-scratch.html)In this project, you will train a deep convolutional network from scratch to recognize scenes. The starter code givesyou a very simple network architecture which doesnt work that well and you will add jittering, normalization,regularization, and more layers to increase recognition accuracy to 50, 60, or perhaps 70%. Unfortunately, we onlyhave 1,500 training examples so it doesnt seem possible to train a network from scratch which outperforms handcraftedfeatures.Project materials including starter code, training and testing data, and html writeup template: hw7.zip.You must separately download MatConvNet 1.0 beta 16. (direct link to beta 16). MAKE SURE YOU USE THEBETA 16 VERSION.MatConvNet isnt precompiled like VLFeat, so we compiled it for you.Make sure you have the MATLAB Parallel Computing Toolbox installed.Starter Code OutlineThe following is an outline of the stencil code:hw7.m . The top level function for training a deep network from scratch for scene recognition. If you run this startercode unmodified it will train a simple network that achieves only 25% accuracy (about as good as the tiny imagesbaseline in project 4). hw7.m calls:hw7_setup_data.m . Loads the 15 scene database into MatConvNet imdb format.hw7_cnn_init.m . Initializes the convolutional network by specifying the various layers which perform convolution,max pooling, non-linearities, normalization, regularization, and the final loss layer.hw7_get_batch() (defined inside hw7.m ) This operates on each batch of training images to be passed into thenetwork during training. This is where you can jitter your training data.The deep network training will be performed by cnn_train.m which in turn calls vl_simplenn.m(http://www.vlfeat.org/matconvnet/mfiles/simplenn/vl_simplenn/) but you will not need to modify those functions for thisproject. 2018/11/22 Homework Seven (Grads Only)https://elearning.mines.edu/courses/9621/assignments/41937 2/12Part 0First install MatConvNet and make sure it is working, using the code below. Step through the following MatConvNetQuick start demo. You can simply copy and paste the commands into the Matlab command window.% install and compile MatConvNet% (you can skip this if you already installed MatConvNet beta 16 and the mex files)% untar(http://www.vlfeat.org/matconvnet/download/matconvnet-1.0-beta16.tar.gz) ;% cd matconvnet-1.0-beta16% run matlab/vl_compilenn% download a pre-trained CNN from the web (needed once)websave(imagenet-vgg-f.mat, http://www.vlfeat.org/matconvnet/models/imagenet-vgg-f.mat) ;% If you have problems downloading through Matlab,% you can simply put the url into a web browser and save the file.% Or potentially modify the save location (and adapt below) to save somewhere where you have permissions.% setup MatConvNet. Your path might be different.run ../../matconvnet-1.0-beta16/matlab/vl_setupnn% load the 233MB pre-trained CNNnet = load(imagenet-vgg-f.mat) ;% Fix any compatibility issues with the networknet = vl_simplenn_tidy(net) ;% load and preprocess an imageim = imread(peppers.png) ;im_ = single(im) ; % note: 255 rangeim_ = imresize(im_, net.meta.normalization.imageSize(1:2)) ;im_ = im_ - net.meta.normalization.averageImage ;% run the CNNres = vl_simplenn(net, im_) ;% show the classification resultscores = squeeze(gather(res(end).x)) ;[bestScore, best] = max(scores) ;figure(1) ; clf ; imagesc(im) ;title(sprintf(%s (%d), score %.3f,... net.meta.classes.description{best}, best, bestScore)) ;Troubleshooting. If you encounter errors trying to run this demo, make sure: (1) You have MatConvNet 1.0 beta 16(not a different version). (2) Your mex files are in the correct location [MatConvNetPath]/matlab/mex/ . If you encountererrors about invalid mex files in Windows you may be missing Visual C++ Redistributable Packages(https://www.microsoft.com/en-us/download/details.aspx?id=40784) . If you encounter an error aboutabout labindex being undefined you may be missing the parallel computing toolbox for Matlab.Before we start building our own deep convolutional networks, it might be useful to have a look at MatConvNetstutorial (http://www.robots.ox.ac.uk/~vgg/practicals/cnn/index.html) . In particular, you should be able to understandPart 1 of the tutorial. In addition to the examples shown in parts 3 and 4 of the tutorial, MatConvNet has examplecode (http://www.vlfeat.org/matconvnet/training/) for training networks to recognize the MNIST and CIFAR datasets.Your project follows the same outline as those examples. Feel free to take a look at that code for inspiration. You can2018/11/22 Homework Seven (Grads Only)https://elearning.mines.edu/courses/9621/assignments/41937 3/12run the example code to watch the training process MNIST and CIFAR. Training will take about 5 and 15 minutes forthose datasets, respectively.Compiling MatConvNet with GPU support is more complex and not needed for this project. If youre trying to do extracredit and find yourself frustrated with training times you can try, though.Training a deep network from scratchFor the bits below, make sure you are running your code from the code directory of the hw7 directory; if you get errors,make sure your hw7 directory and /home/tom/Downloads/matconvnet-1.0-beta16 directory are in the same higherleveldirectory.Run hw7.m and bask in the glory of deep learning. Gone are the days of hand designed features. Now we have endto-endlearning in which a highly non-linear representation is learned for our data to maximize our objective (in thiscase, 15-way classification accuracy). Instead of an anemic 70% accuracy we can now recognize scenes with... 25%accuracy. OK, that didnt work at all. Whats going on?First, lets take a look at the network architecture used in this experiment. Here is the code from hw7_cnn_init.m thatspecifies the network structure:Lets make sure we understand whats going on here. This simple baseline network has 4 layers -- a convolutionallayer, followed by a max pool layer, followed by a rectified linear layer, followed by another convolutional layer. Thislast convolutional layer might be called a fully connected or fc layer because its output has a spatial resolution of1x1. Equivalently, every unit in the output of that layer is a function of the entire previous layer (thus fully connected).But mathematically theres not really any difference from convolutional layers so we specify them in the same way inMatConvNet.Lets look at the first convolutional layer. The weights are the filters being learned. They are initialized with randomnumbers from a Gaussian distribution. The inputs to randn(9,9,1,10) mean the filters have a 9x9 spatial resolution,span 1 filter depth (because the input images are grayscale), and that there are 10 filters. The network also learns a2018/11/22 Homework Seven (Grads Only)https://elearning.mines.edu/courses/9621/assignments/41937 4/12bias or constant offset to associate with the output of each filter. This is what zeros(1,10) initializes.The next layer is a max pooling layer. It will take a max over a 7x7 sliding window and then subsample the resultingimage / map with a stride of 7. Thus the max pooling layer will decrease the spatial resolution by a factor of 7according to the stride parameter. The filter depth will remain the same (10). There are other pooling possibilities(e.g. average pooling) but we will only use max pooling in this project.The next layer is the non-linearity. Any values in the feature map from the max pooling layer which are negative will beset to 0. There are other non-linearity possibilities (e.g. sigmoid) but we will use only rectified linear in this project.Note that the pool layer and relu layer have no learned parameters associated with them. We are hand-specifying theirbehavior entirely, so there are no weights to initialize as in the convolutional layers.Finally, we have the last layer which is convolutional (but might be called fully connected because it happens toreduce the spatial resolution to 1x1). The filters learned at this layer operate on the rectified, subsampled, maxpooledfilter responses from the first layer. The output of this layer must be 1x1 spatial resolution (or data size) andit must have a filter depth of 15 (corresponding to the 15 categories of the 15 scene database). This is achieved byinitializing the weights with randn(8,8,10,15) . 8x8 is the spatial resolution of the filters. 10 is the number of filterdimensions that each of these filters take as input and 15 is the number of dimensions out. 10 is highlighted in greento emphasize that it must be the same in those 3 places -- if the first convolutional layer has weights for 10 filters, itmust also have offsets for 10 filters, and the next convolutional layer must take as input 10 filter dimensions.At the top of our network we add one more layer which is only used for training. This is the loss layer. There aremany possible loss functions but we will use the softmax loss for this project. This loss function will measure howbadly the network is doing for any input (i.e. how different its final layer activations are from the ground truth, whereground truth in our case is category membership). The network weights will update, through backpropagation, basedon the derivative of the loss function. With each training batch the network weights will take a tiny gradient descentstep in the direction that should decrease the loss function (but isnt actually guaranteed to, because the steps are ofsome finite length, or because dropout regularization will turn off part of the network).How did we know to make the final layer filters have a spatial resolution of 8x8? Its not obvious because we dontdirectly specify output resolution. Instead it is derived from the input image resolution and the filter widths, padding,and strides of the previous layers. Luckily MatConvNet provides a visualization function vl_simplenn_display to help usfigure this out. Here is what it looks like if we specify the net as shown above and then call vl_simplenn_display(net,inputSize, [64 64 1 50]) .2018/11/22 Homework Seven (Grads Only)https://elearning.mines.edu/courses/9621/assignments/41937 5/12If the last convolutional layer had a filter size of 6x6 that would lead to a data size in the network visualization of 3x3and we would know we need to change things (subsample more in previous layers or create wider filters in the finallayer). In general it is not at all obvious what the right network architecture is. It takes a lot of artistry (read as: blackmagic) to design the right network and training strategy for optimal performance.We just said the network has 4 real layers but this visualization shows 6. Thats because it includes a layer 0 which isthe input image and a layer 5 which is the loss layer. For each layer this visualization shows several useful attributes.data size is the spatial resolution of the feature maps at each level. In this network and most deep networks thiswill decrease as you move up thet network. data depth is the number of channels or filters in each layer. This willtend to increase as you move up a network. rf size is the receptive field size. That is how large an area in the originalimage a particular network unit is sensitive to. This will increase as you move up the network. Finally this visualizationshows us that the network has 10,000 free parameters, the vast majority of them associated with the last convolutionallayer.OK, now we understand a bit about the network. Lets analyze its performance. After 30 training epochs (30 passesthrough the training data) Matlabs Figure 1 should look like this:(Note: your plots might slightly different due to updates in MATLAB.)2018/11/22 Homework Seven (Grads Only)https://elearning.mines.edu/courses/9621/assignments/41937 6/12Well be studying these figures quite a bit during this project so its important to understand what it shows.The left pane shows the training loss (blue) and validation loss (dashed orange) across training epochs. Each trainingepoch is a pass over the entire training set of 1500 images broken up into batches of 50 training instances. The codeshuffles the order of the training instances randomly each epoch. When the network makes mistakes, it incurs a lossand backpropagation updates the weights of the network in a direction that shoulddecrease the loss. Therefore theblue line should more or less decrease monotonically. On the other hand, the orange line is the loss incurred ontheheld out test set. The figure refers to it as val or validation. In a realistic recognition scenario we might havethree sets of data: train, validation, and test. We would use validation to assess how well our training is working and toknow when to stop training and then we would test on a completely held out test set. For this project the validation setis our test set. Were trying to maximize performance on the validation set and thats it. The pass through the validationset does not change the network weights in any way. The pass through the validation set is also 3 times faster thanthe training pass because it does not have the backwards pass to update network weights.The middle pane shows the training and testing accuracy on the train and test (val) data sets across the same trainingepochs. It shows top 1 error -- how often the highest scoring guess is wrong. Were interested in top 1 error,2018/11/22 Homework Seven (Grads Only)https://elearning.mines.edu/courses/9621/assignments/41937 7/12specifically the top 1 error on the held out validation / test set.The right pane shows top 5 error -- how often all of the 5 highest scoring guesses are wrong. Were not as worriedabout this metric.In this experiment, the training and test top 1 error started out around 93% which is exactly what we would expect. Ifyou have 15 categories and you make a random guess on each test case, you will be wrong 93% of the time. As thetraining progressed and the network weights moved away from their random initialization, accuracy increased.Note the areas circled in green corresponding to the first 8 training epochs. During these epochs, thetraining and validation error were decrea代写Grads Only作业、代做Network留学生作业、代写Matlab课程设计作业、Matlab实验作业代做 代做sing which is exactly what we want to see. Beyond that point the error on thetraining dataset kept decreasing, but the validation error did not. Our lowest error on the validation/test set is around75% (or 25% accuracy). We are overfitting to our training data. This is hard to avoid with a small training set. In fact, ifwe let this experiment run for 200 epochs we see that it is possible for the training accuracy to become perfect with noappreciable increase in test accuracy:Now we are going to take several steps to improve the performance of our convolutional network. The modificationswe make will familiarize you with the building blocks of deep learning that can lead to impressive accuracy withenough training data. With the relatively small amount of training data in the 15 scene database, it is very hard tooutperform hand-crafted features.Learning rate. Before we start making changes, there is a very important learning parameter that you might need to2018/11/22 Homework Seven (Grads Only)https://elearning.mines.edu/courses/9621/assignments/41937 8/12tune any time you change the network or the data being input to the network. The learning rate (set by defaultas opts.LearningRate = 0.0001 in hw7.m ) determines the size of the gradient descent steps taken by the networkweights. If things arent working, try making it much smaller or larger (e.g. by factors of 10). If the objective remainsexactly constant over the first dozen epochs, the learning rate might have been too high and broken some aspect ofthe network. If the objective spikes or even becomes NaN then the learning rate may also be too large. However, avery small learning rate requires many training epochs.Problem 1: We dont have enough training data. Lets jitter.If you left-right flip (mirror) an image of a scene, it never changes categories. A kitchen doesnt become a forest whenmirrored. This isnt true in all domains -- a d becomes a b when mirrored, so you cant jitter digit recognitiontraining data in the same way. But we can synthetically increase our amount of training data by left-right mirroringtraining images during the learning process.The learning process calls getBatch() in hw7.m each time it wants training or testing images. Modify getBatch() torandomly flip some of the images (or entire batches). Useful functions: rand and fliplr .You can try more elaborate forms of jittering -- zooming in a random amount, rotating a random amount, taking arandom crop, etc. Mirroring helps quite a bit on its own, though, and is easy to implement. You should see a 5% to10% increase in accuracy (or drop in top 1 validation error) by adding mirroring.After you implement mirroring, you should notice that your training error doesnt drop as quickly. Thats actually a goodthing, because it means the network isnt overfitting to the 1,500 original training images as much (because it sees3,000 training images now, although theyre not as good as 3,000 truly independent samples). Because the trainingand test errors fall more slowly, you may need more training epochs or you may try modifying the learning rate.Problem 2: The images arent zero-centered.One simple trick which can help a lot is to subtract the mean from every image. Modify hw7_setup_data.m so that itcomputes the mean image and then subtracts the mean from all images before returning imdb . It would arguably bemore proper to only compute the mean from the training images (since the test/validation images should be strictlyheld out) but it wont make much of a difference. After doing this you should see another 15% or so increase inaccuracy. Most of this increase will show up in the first few iterations.Problem 3: Our network isnt regularized.If you train your network (especially for more than the default number of epochs) youll see that the training error candecrease to zero while the val top1 error hovers at 40% to 50%. The network has learned weights which can perfectlyrecognize the training data, but those weights dont generalize to held out test data. The best regularization would bemore training data but we dont have that. Instead we will use dropout regularization. We add a dropout layer to ourconvolutional net as follows:What does dropout regularization do? It randomly turns off network connections at training time to fight overfitting. Thisprevents a unit in one layer from relying too strongly on a single unit in the previous layer. Dropout regularization can2018/11/22 Homework Seven (Grads Only)https://elearning.mines.edu/courses/9621/assignments/41937 9/12be interpreted as simultaneously training many thinned versions of your network. At test test, all connections arerestored which is analogous to taking an average prediction over all of the thinned networks. You can see a morecomplete discussion of dropout regularization in this paper(https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf) .The dropout layer has only one free parameter -- the dropout rate -- the proportion of connections that are randomlydeleted. The default of 0.5 should be fine. Insert a dropout layer between your convolutional layers. In particular, insertit directly before your last convolutional layer. Your test accuracy should increase by another 10%. Your train accuracyshould decrease much more slowly. Thats to be expected -- youre making life much harder for the training algorithmby cutting out connections randomly.If you increase the number of training epochs (and maybe decrease the learning rate) you should be able to achieve55% test accuracy (45% top1 val error) or slightly better at this point. Notice how much more structured the learnedfilters are at this point compared to the initial network before we made improvements:Problem 4: Our network isnt deep.Lets take a moment to reflect on what our convolutional network is actually doing. We learn filters which seem to belooking horizontal edges, vertical edges, and parallel edges. Some of the filters have diagonal orientations and someseem to be looking for high frequencies or center-surround. This learned filter bank is applied to each input image, themaximum response from each 7x7 block is taken by the max pooling, and then the rectified linear layer zeros outnegative values. The fully connected layer sees a 10 channel image with 8x8 spatial resolution. It learns 15 linearclassifiers (a linear filter with a learned threshold is basically a linear classifier) on this 8x8 filter response map. Thisarchitecture is reminiscent of hand-crafted features like the gist scene descriptor(http://people.csail.mit.edu/torralba/code/spatialenvelope/) developed precisely for scene recoginition (on 8 scenecategories which would later be incorporated into the 15 scene database). The gist descriptor actually works betterthan our learned feature. The gist descriptor with a non-linear classifier can achieve 74.7% accuracy on the 15 scenedatabase. 2018/11/22 Homework Seven (Grads Only)https://elearning.mines.edu/courses/9621/assignments/41937 10/12Our convolutional network to this point isnt deep. It has two layers with learned weights. Contrast this with theexample networks for MNIST and CIFAR in MatConvNet which contain 4 and 5 layers, respectively. AlexNet andVGG-F contain 8 layers. The VGG very deep networks (http://www.robots.ox.ac.uk/~vgg/research/very_deep/)contain 16 and 19 layers. ResNet (https://arxiv.org/abs/1512.03385) contains up to 150 layers.One quite unsatisfying aspect of our current network architecture is that the max-pooling operation covers a window of7x7 and then is subsampled with a stride of 7. That seems overly lossy and deep networks usually do not subsampleby more than a factor of 2 or 3 each layer.Lets make our network deeper by adding an additional convolutional layer in hw7_cnn_init.m . In fact, we probably dontwant to add just a convolutional layer, but another max-pool layer and relu layer, as well. For example, you mightinsert a convolutional layer after the existing relu layer with a 5x5 spatial support followed by a max-pool over a 3x3window with a stride of 2. You can reduce the max-pool window in the previous layer, adjust padding, and reduce thespatial resolution of the final layer until vl_simplenn_display(net, inputSize, [64 64 1 50]) , which is called at the endof hw7_cnn_init() shows that your networks final layer (not counting the softmax) has a data size of 1 and a datadepth of 15. You also need to make sure that the data depth output by any channel matches the data depth input tothe following channel. For instance, maybe your new convolutional layer takes in the 10 channels of the first layer butoutputs 15 channels. The final layer would then need to have its weights initialized accordingly to account for the factthat it operates on a 15 channel image instead of a 10 channel image.Training deeper networks is tricky. The networks are slower to train and more sensitive to initialization and learningrate. For now, try to add one or two more blocks of conv / pool / relu and see if you can simply match yourperformance of the previous section.Problem 5: Our deep network is slow to train and brittle.You might have noticed that your deeper network doesnt seem to learn very reasonable filters in the first layer. It isharder for the gradients to pass from the last layer all the way to the first in a deeper architecture. Normalization canhelp. In particular, lets add a batch normalization (https://arxiv.org/abs/1502.03167) layer after each convolutionallayer except for the last. So if you have 4 total convolutional layers we will add 3 batch normalization layers.Add the following code to hw7_cnn_init(). You can call this function to add batch normalization to your conv layersafter youve built your network. Be careful with the layer indexing, because calling net = insertBnorm(net,layer_index) will add a new layer and thus shift the index of later convolutional layers.% --------------------------------------------------------------------function net = insertBnorm(net, l)% --------------------------------------------------------------------assert(isfield(net.layers{l}, weights));ndim = size(net.layers{l}.weights{1}, 4);layer = struct(type, bnorm, ...weights, {{ones(ndim, 1, single), zeros(ndim, 1, single)}}, ...learningRate, [1 1 0.05], ...weightDecay, [0 0]) ;net.layers{l}.weights{2} = [] ; % eliminate bias in previous conv layernet.layers = horzcat(net.layers(1:l), layer, net.layers(l+1:end)) ;2018/11/22 Homework Seven (Grads Only)https://elearning.mines.edu/courses/9621/assignments/41937 11/12Batch normalization by itself wont necessarily increase accuracy, but it will allow you to use much higher learningrates. Try increasing your learning rate by a factor of 10 or even 100 and now you can rapidly explore and traindifferent network architectures. Notice how the first layer filters start to show structure quickly with batchnormalization.We leave it up to you to determine the specifics of your deeper network: number of layers, number of filters in eachlayer, size of filters in each layer, padding, max-pooling, stride, dropout factor, etc. It is not required that your deepernetwork increases accuracy over the shallow network (but it can with the right hyperparameters). As long as you canachieve 50% test accuracy for some epoch with a deeper network which uses mirroring to jitter, zero-centers theimages as they are loaded, and regularizes the network with a dropout layer you will receive full credit. Try to keep thetotal training time under 10 minutes. You can achieve high accuracy in 100 epochs and in less in less than 10minutes.Additional optional improvementsEnjoy chasing higher accuracy? Heres some optional directions to investigate which might help improve youraccuracy.If you look at MatConvNets ImageNet examples you can see that the learning rate isnt constant during training.You can specify learning rate as pts.learningRate = logspace(-3, -5, 120) to have it change from .001 to .00001 over120 training epochs, for instance. This form of learning rate schedule can improve performance slightly.You can try increasing the filter depth of the network. The example networks for MNIST, CIFAR, and ImageNethave 20, 32, and 64 filters in the first layer and it tends to increase as you go up the network.The MNIST, CIFAR, and ImageNet examples in MatConvNet show numerous advanced strategies: Use ofnormalization layers, variable learning rate per layer (the two elements of the per-layer learning ratein cnn_cifar_init.m are the relative learning rates for the filters and the bias terms), use of average pooling insteadof max pooling for some layers, initializing weights with distributions other than randn , more dramatic jittering, etc.The more free parameters your network has the more prone to overfitting it is. Multiple dropout layers can helpfight back against this, but will slow down training considerably.One obvious limitation of our network is that it operates on 64x64 images when the scene images are generallycloser to 256x256. Were definitely losing valuable texture information by working at low resolution. Luckily, its notnecessarily slow to work with the higher resolution images if you put a greater-than-one stride in your firstconvolutional layer. The VGG-F network adopts this strategy. You can see in cnn_imagenet_init.m that its first layeruses 11x11 filters with a stride of 4. This is 1/16th as many evaluations as a stride of 1.The images can be normalized more strongly (e.g. making them have unit standard deviation) but this did not helpin my experiments.You can try alternate loss layers at the top of your network. E.g. net.layers{end+1} = struct(name, hinge loss,type, loss, loss, mhinge) for hinge loss.You can train the VGG-F network from scratch on the 15 scene database. You can call cnn_imagenet_init.m to get arandomly initialized VGG-F and train it just like your other networks. It works better than I would expect consideringhow little training data we have.The best accuracy you can expect to achieve is 67.6% in 100 epochs, without any data augmentation beyondmirroring and without changing the input resolution.Write up2018/11/22 Homework Seven (Grads Only)https://elearning.mines.edu/courses/9621/assignments/41937 12/12For this project, submit a PDF project report. In the report you will describe your algorithm and any decisions youmade to write your algorithm a particular way. Then you will show and discuss the results of your algorithm. Wesuggest showing results plots (Matlab figure 1) and filter visualization (Matlab figure 2) as needed. In addition, submitan archive containing all your code for this assignment. Do not hand in any networks, training data, or MatConvNetitself! You only need to hand in 3 source files: hw7.m, hw7_cnn_init.m, and hw7_setup_data.m . You can of course hand inany helper functions you created.All turn-in files should be compiled into a single archive (e.g., .zip).Credits: The design of this homework assignment comes from Georgia Techs CV course.转自:http://ass.3daixie.com/2018120211772746.html