Karen Simonyan & Andrew Zisserman
Karen Simonyan
In this work we investigate the effect of the convlutional network depth on its accuracy in the large-scale image recognation.
A number of attempts have been made to improve the original architecture of Krizhevsky et al. in a bid to achieve better accuracy:
1. ZF(zeiler&Fergus) overFeat(Sermanet et al.) utilised smaller receptive window size and smaller stride of the first convolutional layer.
2. densely over the whole image and over multiple scales.
3. this paper: Depth
We note that none of our networks contain Local Response Normalisation (LRN), as will be shown in Sect. 4 , Such normalisation dees not imporve the performance of the ILSVRC dataset, but lead to increased memory consumption and computation time.
batch size: 256
momentum: 0.9
L2:5e-4
fc drop :0.5
lr:0.01 then decreased by a feator of 10 when the validation set accuracy stopped after 370k
pre-initialisation: A <-random initialisation; deeper<-A’s first three and last three fc layers. but it is possible to inilialise the weights with out pre-training by using the random initialisation procedure of Glorot:
Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proc.
AISTATS, volume 9, pp. 249–256, 2010.
Training Image size:
let s be the smallest side of the trining image.
We consider two approaches for setting the training scales S:
1. fix S : two scales: S=256 and 384
2. multi-scale training: each training image is individually rescaled by randomly sampling S form a certain rangesmin, smax
namely, the fc layers are first converted to convolutional layers.(????)
2. using a large set of crops, as done by Szegedy , can lead to improved accuracy. while, we believe that in practice the increase computation time of multiple crops does not justify the potential gains in accuracy.
C++ caffe;
multi-GPU: exploits data parallelism, and it carried out by splitting each batch of training images into several GPU batches, processed in parallel on each GPU, and then average them to obtain the gradient of the full batch.
While more sophisticated methods of speeding up ConvNet training have been proposed, which employ models of speeding up model and data parallelism for different layers of the net.
We now assess the effect of scale jittering at test time.
It indicated that scale jittering in test time Lead to better performance.
dese & multi-crop evaluation
(see sect.3.2 for detail)
We hypothesize that this is due to a different treatment of convolution boundary conditions.
depth is important in visual representations
We adopt the approach of overfeat.
we use the last fully connected layer predicts the bounding box location instead of the class scores.
There is a choice of whether the bounding box prediction is shared across all classes(scr=single class regress )or is class-specific(per=class regress) In the former case , the last layer is 4D, while the latter is 4000D(for 1000 classes)
we did not use the multiple pooling offsets technique of overfeat
PCR outperforms the SCR
fine-tuning all layers noticeably better than fine-tuning only the fc layers
Testing at several scales and combing the predictions of multiple networks further imporves the performance.