最近在做毕设,题目是基于深度学习的行人检测。然后决定要用最popular的caffe深度学习框架,在实验室的Ubuntu服务器上训练。
因为以前从来没用过Linux和ubuntu这种的命令行操作系统,所以学习过程中还是遇到了不少困难,这里记录一下,供自己和大家参考。
首先学习了caffe官网上的example,最简单的使用LeNet识别的Mnist手写数字训练集:http://caffe.berkeleyvision.org/gathered/examples/mnist.html
看了一下网上安装caffe的过程,感觉很麻烦,幸好师兄之前在服务器上安装过caffe,所以直接把他的caffe文件夹拷贝到自己的根目录下就可以直接用了(省去了一两天的工作量,鸣谢sj师兄!)。
然后按照上面网址上的指导一步步往下走。
~$ cd caffe-master
~/caffe-master$ ./data/mnist/get_mnist.sh
然后就出了问题,-bash: ./data/mnist/get_mnist.sh 权限不够。因为这个脚本还没有执行权限,所以先给它加上执行权限:
~/caffe-master$ chmod +x ./data/mnist/get_mnist.sh
再执行~/caffe-master$ ./data/mnist/get_mnist.sh,就出现了开始执行的命令行:
Downloading...
--2016-03-13 15:07:17-- http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
正在解析主机 yann.lecun.com (yann.lecun.com)... 128.122.47.89
正在连接 yann.lecun.com (yann.lecun.com)|128.122.47.89|:80... 已连接。
已发出 HTTP 请求,正在等待回应... 200 OK
长度: 9912422 (9.5M) [application/x-gzip]
正在保存至: “train-images-idx3-ubyte.gz”
100%[=========================================================================================================================================>] 9,912,422 10.9KB/s 用时 9m 54s
2016-03-13 15:17:14 (16.3 KB/s) - 已保存 “train-images-idx3-ubyte.gz” [9912422/9912422])
--2016-03-13 15:17:14-- http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
正在解析主机 yann.lecun.com (yann.lecun.com)... 128.122.47.89
正在连接 yann.lecun.com (yann.lecun.com)|128.122.47.89|:80... 已连接。
已发出 HTTP 请求,正在等待回应... 200 OK
长度: 28881 (28K) [application/x-gzip]
正在保存至: “train-labels-idx1-ubyte.gz”
100%[=========================================================================================================================================>] 28,881 37.3KB/s 用时 0.8s
2016-03-13 15:17:25 (37.3 KB/s) - 已保存 “train-labels-idx1-ubyte.gz” [28881/28881])
--2016-03-13 15:17:25-- http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
正在解析主机 yann.lecun.com (yann.lecun.com)... 128.122.47.89
正在连接 yann.lecun.com (yann.lecun.com)|128.122.47.89|:80... 已连接。
已发出 HTTP 请求,正在等待回应... 200 OK
长度: 1648877 (1.6M) [application/x-gzip]
正在保存至: “t10k-images-idx3-ubyte.gz”
100%[=========================================================================================================================================>] 1,648,877 35.7KB/s 用时 97s
2016-03-13 15:19:13 (16.6 KB/s) - 已保存 “t10k-images-idx3-ubyte.gz” [1648877/1648877])
--2016-03-13 15:19:13-- http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
正在解析主机 yann.lecun.com (yann.lecun.com)... 128.122.47.89
正在连接 yann.lecun.com (yann.lecun.com)|128.122.47.89|:80... 已连接。
已发出 HTTP 请求,正在等待回应... 200 OK
长度: 4542 (4.4K) [application/x-gzip]
正在保存至: “t10k-labels-idx1-ubyte.gz”
100%[=========================================================================================================================================>] 4,542 --.-K/s 用时 0s
2016-03-13 15:19:13 (12.9 MB/s) - 已保存 “t10k-labels-idx1-ubyte.gz” [4542/4542])
Unzipping...
gzip: train-images-idx3-ubyte already exists; do you wish to overwrite (y or n)? y
gzip: train-labels-idx1-ubyte already exists; do you wish to overwrite (y or n)? y
gzip: t10k-images-idx3-ubyte already exists; do you wish to overwrite (y or n)? y
gzip: t10k-labels-idx1-ubyte already exists; do you wish to overwrite (y or n)? y
Done.
然后再执行下面一行:
./examples/mnist/create_mnist.sh
出现如下提示:
Creating lmdb...
./examples/mnist/create_mnist.sh: 16: ./examples/mnist/create_mnist.sh: build/examples/mnist/convert_mnist_data.bin: Permission denied
./examples/mnist/create_mnist.sh: 18: ./examples/mnist/create_mnist.sh: build/examples/mnist/convert_mnist_data.bin: Permission denied
Done.
注意到build/examples/mnist/convert_mnist_data.bin: Permission denied,看来convert_mnist_data.bin这个文件也没有权限,因此给它加执行权限之后再重新运行即可:
~/caffe-master$ chmod +x build/examples/mnist/convert_mnist_data.bin
~/caffe-master$ ./examples/mnist/create_mnist.sh
Creating lmdb...
Done.
这样就把MNIST从网站上下载下来并对数据格式进行了转换,完成了准备工作。
在正式开始训练和测试我们的模型之前,先对LeNet有一个大致了解,如下图所示,它由一个卷积层、后面跟一个下采样层、再跟另外一个卷积层和另一个下采样层,再之后是两个全连接层组成。这里caffe中用的示例和original LeNet的区别是使用ReLU(Rectified Linear Unit)取代了sigmoid激活函数。
LeNet各层的属性在$CAFFE_ROOT/examples/mnist/lenet_train_test.prototxt中进行了定义。
使用~/caffe-master$ vi ./examples/mnist/lenet_train_test.prototxt命令即可查看网络各层的定义。
name: "LeNet" //网络名称是LeNet
layer {
name: "mnist" //数据层名称是mnist
type: "Data" //类型是数据
top: "data" //输出数据到两个Blob,data和label
top: "label"
include {
phase: TRAIN
}
transform_param {
scale: 0.00390625 //确保输出数据在[0,1)之间,所以乘以1/256
}
data_param {
source: "examples/mnist/mnist_train_lmdb" //从这里获得数据
batch_size: 64 //每批大小是64
backend: LMDB
}
}
layer {
name: "mnist"
type: "Data"
top: "data"
top: "label"
include {
phase: TEST
}
transform_param {
scale: 0.00390625
}
data_param {
source: "examples/mnist/mnist_test_lmdb"
batch_size: 100 //每批大小是100
backend: LMDB
}
}
然后是第一个卷积层和下采样层:
layer {
name: "conv1"
type: "Convolution"
bottom: "data" //以下层传输过来的data Blob作为输入
top: "conv1" //这层数据输出到Blob conv1
param {
lr_mult: 1 //lr为learning rate,学习率
}
param {
lr_mult: 2 //bias的学习率是weight的两倍
}
convolution_param {
num_output: 20 //输出有20个channel
kernel_size: 5 //卷积核大小为5
stride: 1 //卷积步长为1
weight_filler {
type: "xavier" //使用xavier algorithm,根据输入和输出神经元的数目,自动确定初始化权重的范围
}
bias_filler {
type: "constant" //将偏置初始化为常数,且为0
}
}
}
layer {
name: "pool1"
type: "Pooling" //层的类型是Pooling
bottom: "conv1" //输入是conv1 Blob
top: "pool1" //输出是pool1 Blob
pooling_param {
pool: MAX //下采样方式是最大值采样
kernel_size: 2 //在2*2的区域内选择最大值
stride: 2 //步长为2,防止区域有重叠
}
}
第二个卷积层和下采样层也都是类似的,就不再赘述了,下面是两个全连接层:
layer {
name: "ip1"
type: "InnerProduct" //Fully Connection Layer在caffe中也叫Inner Product
bottom: "pool2" //输入是pool2 Blob
top: "ip1" //输出是ip1 Blob
param {
lr_mult: 1
}
param {
lr_mult: 2
}
inner_product_param {
num_output: 500 //输出的神经元个数为500
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "relu1"
type: "ReLU"
bottom: "ip1" //把输入和输出的Blob设为同一个名字,可以是对单个元素操作的relu节省存储空间
top: "ip1"
}
然后是另一个全连接层,不过只有10个输出,对应10个数字。接下来就是Loss层(和Accuracy层,只在test阶段使用):
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "ip2" //将全连接层的prediction和data层输出的label作为输入
bottom: "label"
top: "loss"
}
这一层没有进一步的输出,只计算损失函数值,当BP开始时将loss 报告出来。这个网络的定义就到此结束了。
此外,还有一点需要注意的是,当如下的格式出现时,
layer {
// ...layer definition...
include: { phase: TRAIN }
}
说明这一层只在TRAIN阶段出现在网络中,当处在TEST阶段时,这一层不出现在网络中。没有这个标志的层始终出现在网络当中。所以在以上的定义中,DATA层以不同的BATCH出现了两次,分别是TRAIN和TEST阶段。另外在测试阶段还有一个Accuracy层,每100次迭代就计算一下准确率。
再输入命令行:~/caffe-master$ vi ./examples/mnist/lenet_solver.prototxt,可以看到MNIST solver的配置情况:
The train/test net protocol buffer definition
net: "examples/mnist/lenet_train_test.prototxt"
# test_iter specifies how many forward passes the test should carry out.
# In the case of MNIST, we have test batch size 100 and 100 test iterations,
# covering the full 10,000 testing images.
test_iter: 100
# Carry out testing every 500 training iterations.
test_interval: 500
# The base learning rate, momentum and the weight decay of the network.
base_lr: 0.01
momentum: 0.9
weight_decay: 0.0005
# The learning rate policy
lr_policy: "inv"
gamma: 0.0001
power: 0.75
# Display every 100 iterations
display: 100
# The maximum number of iterations
max_iter: 10000
# snapshot intermediate results
snapshot: 5000
snapshot_prefix: "examples/mnist/lenet"
# solver mode: CPU or GPU
solver_mode: GPU
这里可以看到网络训练的配置,每批次训练100张图片,共100批次10000张图片,基础的学习率是0.1,使用GPU计算。因为这里的训练量较小,所以GPU的速度优势还看不太出来,如果在大一些的网络和训练集中,GPU的速度优势会更加明显。说了这么多,还没有执行正式的学习程序,命令行中输入:
~/caffe-master$ chmod +x ./examples/mnist/train_lenet.sh
~/caffe-master$ ./examples/mnist/train_lenet.sh
./examples/mnist/train_lenet.sh: 3: ./examples/mnist/train_lenet.sh: ./build/tools/caffe: Permission denied(caffe没有执行权限,所以下一行先加权限)
~/caffe-master$ chmod +x ./build/tools/caffe
~/caffe-master$ ./examples/mnist/train_lenet.sh
就正式开始了训练和测试,正常情况下像MNIST这个级别的数据量应该几分钟就可以训练完。截取最后几行:
I0313 19:42:17.450871 4595 solver.cpp:243] Iteration 9900, loss = 0.00442233
I0313 19:42:17.450918 4595 solver.cpp:259] Train net output #0: loss = 0.00442246 (* 1 = 0.00442246 loss)
I0313 19:42:17.450938 4595 solver.cpp:590] Iteration 9900, lr = 0.00596843
I0313 19:42:18.987057 4595 solver.cpp:468] Snapshotting to binary proto file examples/mnist/lenet_iter_10000.caffemodel
I0313 19:42:19.001791 4595 solver.cpp:753] Snapshotting solver state to binary proto file examples/mnist/lenet_iter_10000.solverstate
I0313 19:42:19.011090 4595 solver.cpp:327] Iteration 10000, loss = 0.00261991
I0313 19:42:19.011143 4595 solver.cpp:347] Iteration 10000, Testing net (#0)
I0313 19:42:19.923885 4595 solver.cpp:415] Test net output #0: accuracy = 0.9913
I0313 19:42:19.923949 4595 solver.cpp:415] Test net output #1: loss = 0.0260926 (* 1 = 0.0260926 loss)
I0313 19:42:19.923969 4595 solver.cpp:332] Optimization Done.
I0313 19:42:19.923982 4595 caffe.cpp:215] Optimization Done.
这样程序就正式执行完毕了,我在实验室的服务器上大约运行了174秒的样子。运行结果保存在了lenet_iter_10000.solverstate文件中。