Ubuntu16.04+CUDA9.0+CUDNNv7.1+opencv3.4.0+anaconda3+Matlab 2017a的相关安装配置参见之前的博客。
接下来直接进入caffe的安装配置环节。
General dependencies
sudo apt-get install libprotobuf-dev libleveldb-dev libsnappy-dev libopencv-dev libhdf5-serial-dev protobuf-compiler
sudo apt-get install --no-install-recommends libboost-all-dev
接着安装:
sudo apt-get install libgflags-dev libgoogle-glog-dev liblmdb-dev# ubuntu系统独有安装包
sudo apt-get install libatlas-dev
sudo apt-get install liblapack-dev
sudo apt-get install libatlas-base-dev
git clone https://github.com/BVLC/caffe.git
cd caffe
cp Makefile.config.example Makefile.config # 拷贝一个安装配置文件
然后修改 Makefile.config 文件,在 caffe 目录下打开该文件:
sudo gedit Makefile.config
修改 Makefile.config 文件内容:
1.应用 cudnn
将
#USE_CUDNN := 1
修改成:
USE_CUDNN := 1
2.应用 opencv 版本
将
#OPENCV_VERSION := 3
修改为:
OPENCV_VERSION := 3
3.修改cuda路径
将
CUDA_DIR := /usr/local/cuda
修改为
CUDA_DIR := /usr/local/cuda-9.0
4.修改CUDA_ARCH
将
-gencode arch=compute_20,code=sm_20 \
-gencode arch=compute_20,code=sm_21 \
两行注释或删除
5.修改blas
修改为
BLAS := mkl
6.修改MATLAB路径
修改为
MATLAB_DIR := /usr/local/MATLAB/R2017a
7.配置python相关
注释掉 python2
#PYTHON_INCLUDE := /usr/include/python2.7 \
# /usr/lib/python2.7/dist-packages/numpy/core/include
然后配置为
# Anaconda Python distribution is quite popular. Include path:
# Verify anaconda location, sometimes it's in root.
ANACONDA_HOME := $(HOME)/anaconda3
PYTHON_INCLUDE := $(ANACONDA_HOME)/include \
$(ANACONDA_HOME)/include/python3.6m \
$(ANACONDA_HOME)/lib/python3.6/site-packages/numpy/core/include
# Uncomment to use Python 3 (default is Python 2)
PYTHON_LIBRARIES := boost_python3 python3.6m
PYTHON_INCLUDE := /usr/include/python3.6m \
/usr/lib/python3.6/dist-packages/numpy/core/include \
/home/zya/anaconda3/include/python3.6m
# We need to be able to find libpythonX.X.so or .dylib.
#PYTHON_LIB := /usr/lib
PYTHON_LIB := $(ANACONDA_HOME)/lib \
$(ANACONDA_HOME)/pkgs/python-3.6.5-hc3d631a_2/lib
8.使用 python 接口
将
#WITH_PYTHON_LAYER := 1
修改为
WITH_PYTHON_LAYER := 1
9.重要的一项
将# Whatever else you find you need goes here.下面的
INCLUDE_DIRS := $(PYTHON_INCLUDE) /usr/local/include
LIBRARY_DIRS := $(PYTHON_LIB) /usr/local/lib /usr/lib
修改为:
INCLUDE_DIRS := $(PYTHON_INCLUDE) /usr/local/include /usr/include/hdf5/serial
LIBRARY_DIRS := $(PYTHON_LIB) /usr/local/lib /usr/lib usr/lib/x86_64-linux-gnu /usr/lib/x86_64-linux-gnu/hdf5/serial
这是因为ubuntu16.04的文件包含位置发生了变化,尤其是需要用到的hdf5的位置,所以需要更改这一路径
然后修改 caffe 目录下的 Makefile 文件:
将:
NVCCFLAGS +=-ccbin=$(CXX) -Xcompiler-fPIC $(COMMON_FLAGS)
替换为:
NVCCFLAGS += -D_FORCE_INLINES -ccbin=$(CXX) -Xcompiler -fPIC $(COMMON_FLAGS)
将:
LIBRARIES += glog gflags protobuf boost_system boost_filesystem m hdf5_hl hdf5
改为:
LIBRARIES += glog gflags protobuf boost_system boost_filesystem m hdf5_serial_hl hdf5_serial opencv_core opencv_imgproc opencv_imgcodecs opencv_highgui
然后修改 /usr/local/cuda-9.0/include/crt/host_config.h 文件 :
将
#error -- unsupported GNU version! gcc versions later than 6 are not supported!
改为
//#error -- unsupported GNU version! gcc versions later than 6 are not supported!
安装过程中可能遇到的错误
error while loading shared libraries: libpython3.6m.so.1.0 not found
locate libpython3.6m.so.1.0查找的位置
查找出来是在anaconda3/lib中
添加:
sudo gedit /etc/ld.so.conf
/home/zya/anaconda3/lib/
在make runtest后,可能会出现两个失败
**[ FAILED ] 2 tests, listed below:
[ FAILED ] BatchReindexLayerTest/2.TestGradient, where TypeParam = N5caffe9GPUDeviceIfEE
[ FAILED ] BatchReindexLayerTest/3.TestGradient, where TypeParam = N5caffe9GPUDeviceIdEE**
这个问题https://github.com/BVLC/caffe/issues/6164上可以完美解决,我在此翻译一下
vim Makefile
然后用/搜索NVCCFLAGS,知道搜到下面这一段
...
# Debugging
ifeq ($(DEBUG), 1)
COMMON_FLAGS += -DDEBUG -g -O0
NVCCFLAGS += -G
else
COMMON_FLAGS += -DNDEBUG -O2
endif
...
修改为
...
# Debugging
ifeq ($(DEBUG), 1)
COMMON_FLAGS += -DDEBUG -g -O0
NVCCFLAGS += -G
else
COMMON_FLAGS += -DNDEBUG -O2
NVCCFLAGS += -G
endif
...
也就是加一句NVCCFLAGS += -G
然后重新编译,那么所有test都能成功
还可能出现:/usr/lib/x86_64-linux-gnu/libunwind.so.8: undefined reference to `lzma_index_size@XZ_5.0’,解决改问题只需要添加库文件路径就行,在home目录下的命令行输入:
$ sudo gedit ~/.bashrc
在文件中加入:
export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH
在命令行输入:
sudo ldconfig #编译立即生效
可能出现问题“ld cannot find lboost_python3”,这个时候应该创建一个libboost_python-py35.so的符号链接。
具体做法为“cannot find -lboost_python3” when using Python3 Ubuntu16.04:
cd /usr/lib/x86_64-linux-gnu
sudo ln -s libboost_python-py35.so libboost_python3.so
OK ,可以开始编译了,在 caffe 目录下执行 :
sudo make all -j12
sudo make test -j12
但是如果之前的配置或安装出错,那么编译就会出现各种各样的问题,所以前面的步骤一定要细心。
编译成功后可运行测试:
sudo make runtest -j12
最后编译python和matlab接口
sudo make pycaffe -j12
sudo make matcaffe -j12
最后配置caffe的python借口路径
sudo gedit /etc/profile
然后添加:export PYTHONPATH=/home/zya/caffe/python${PYTHONPATH:+:${PYTHONPATH}}
如果此时在终端输入python
import caffe
如果有问题,一般是protobuf问题(因为一开始在Generaldependencies中通过sudo apt-get install libprotobuf-dev protobuf-compiler安装的protobuf是安装在系统中的,版本2.6.1。但此时我们指定的python默认环境是anaconda3.所以此时anaconda3中由于没有安装protobuf所以会出现问题)
要解决这个问题需要执行conda install -c https://conda.anaconda.org/anaconda protobuf安装anaconda3下的protobuf。
安装protobuf时遇到权限不允许,可以看到整个anaconda3目录带小锁。直接sudo chmod -R 777 /home/zya/anaconda3解锁,
然后就顺利执行了,安装的protobuf版本是3.6.0。
然后import caffe就成功了!!!
不过可能留下一个隐患就是如果下次在编译caffe时由于protobuf两个版本会冲突,所以再次编译caffe是会出现protobuf问题。此时现在只能是再卸载annaconda3下的protobuf。。。
卸载的命令是:
conda uninstall libprotobuf
conda uninstall protobuf
注意,一定要libprotobuf和protobuf都卸载掉。
编译成功后,在运行代码的时候python可能会提示找不到protobuf模块,这时候我们再使用
conda install -c https://conda.anaconda.org/anaconda protobuf 将protobuf模块安装上就可以了。
以后再编译caffe的时候如果冲突,再卸,再装。。。一把老泪…
* 总之,出现该问题的解决方法是,卸载python中冲突的protobuf和libprotobuf。*
如果需要卸载安装在系统上的protobuf可以用如下命令(但一般往往用这个编译成功的几率高,所以一般不要卸载,即使卸载也一般需要重新安装sudo apt-get install libprotobuf-dev protobuf-compiler)
sudo apt-get remove libprotobuf-dev
sudo apt-get remove libprotobuf-compile
这里补充几个查看protobuf版本信息之类的命令:
查看哪些路径安装了protoc:
whereis protoc
查看默认调用的protoc是哪个:
which protoc
查看默认的protoc的版本:
protoc --version
查看pip安装的protoc的信息:(我的话就显示我在anaconda下的版本了)
pip show protobuf
欢迎大家指正讨论!
补充,听说再次编译时如果遇到protobuf版本问题时,可以对系统版本protobuf
重新安装一下就可以了。
先卸载protobuf
sudo apt-get remove libprotobuf-dev
sudo apt-get remove libprotobuf-compile
再重新安装sudo apt-get install libprotobuf-dev protobuf-compiler
你们感兴趣可以尝试一下。。。
然后是使用caffe自带的画网络结构图的工具./python/drew_net.py
,可使用该工具来绘制模型图。例如
~/caffe# python/draw_net.py examples/mnist/lenet_train_test.prototxt examples/mnist/lenet_train_test.png
但一般开始时会遇到ModuleNotFoundError: No module named 'pydotplus'或ModuleNotFoundError: No module named 'pydot'的问题。
解决方法:conda install pydotplus
然后运行又报错
pydotplus.graphviz.InvocationException: GraphViz's executables not found
解决 方法:
conda install graphviz
然后画图成功!!!
最后也顺便安装一下pydot
conda install pydot
Caffe官方提供了一系列的example供用户学习。可参见Caffe/examples.
本次的MNIST-LENet参考官方教程。
在提供的examples里,Caffe把数据放在./data
文件夹下,处理后的数据和模型文件等放在 ./examples
文件夹下。本次的MNIST数据集即在./data/mnist
下,对应的模型和配置文件在 ./examples/mnist
下.
先进入Caffe的根目录($CAFFE_ROOT):
cd ~/Caffe
下载MNIST数据集:
# 运行get_mnist.sh脚本
./data/mnist/get_mnist.sh
我们可以看一下这个脚本干啥了(gedit get_mnist.sh):
#!/usr/bin/env sh
# This scripts downloads the mnist data and unzips it.
DIR="$( cd "$(dirname "$0")" ; pwd -P )"
cd "$DIR"
echo "Downloading..."
for fname in train-images-idx3-ubyte train-labels-idx1-ubyte t10k-images-idx3-ubyte t10k-labels-idx1-ubyte
do
if [ ! -e $fname ]; then
wget --no-check-certificate http://yann.lecun.com/exdb/mnist/${fname}.gz
gunzip ${fname}.gz
fi
done
可以看到该shell脚本从http://yann.lecun.com/exdb/mnist/${fname}.gz
依次下载了train-images-idx3-ubyte , train-labels-idx1-ubyte , t10k-images-idx3-ubyte, t10k-labels-idx1-ubyte
4个文件。
等待一段时间下载完毕后解压。
Caffe不直接接收这样的数据集,需要处理成lmdb:
使用create_mnist.sh脚本处理数据:
./examples/mnist/create_mnist.sh
我们也可以看看这个脚本干了啥:
#!/usr/bin/env sh
# This script converts the mnist data into lmdb/leveldb format,
# depending on the value assigned to $BACKEND.
set -e
EXAMPLE=examples/mnist
DATA=data/mnist
BUILD=build/examples/mnist
BACKEND="lmdb"
echo "Creating ${BACKEND}..."
rm -rf $EXAMPLE/mnist_train_${BACKEND}
rm -rf $EXAMPLE/mnist_test_${BACKEND}
$BUILD/convert_mnist_data.bin $DATA/train-images-idx3-ubyte \
$DATA/train-labels-idx1-ubyte $EXAMPLE/mnist_train_${BACKEND} --backend=${BACKEND}
$BUILD/convert_mnist_data.bin $DATA/t10k-images-idx3-ubyte \
$DATA/t10k-labels-idx1-ubyte $EXAMPLE/mnist_test_${BACKEND} --backend=${BACKEND}
echo "Done."
可以看到使用的是./build/examples/mnist/convert_mnist_data.bin
工具完成转换的,这里就不深入看了
到这里数据集算是准备好了,存储在./examples/mnist/
下.
mnist_train_lmdb, and mnist_test_lmdb
.
Caffe的模型文件是以.prototxt
结尾,Caffe提供的LeNet文件在./examples/mnist/lenet_train_test.prototxt
,我们可以打开看看:
数据输入层:
name: "LeNet"
layer {
name: "mnist" //该layer名为mnist
type: "Data" //layer类型
top: "data" //top为输出blob,共输出两个blob
top: "label"
include {
phase: TRAIN //指定训练阶段work
}
transform_param {
scale: 0.00390625 //数据变换(1/255 = .0039)
}
data_param {
source: "examples/mnist/mnist_train_lmdb" //数据源地址
batch_size: 64 //batch大小
backend: LMDB //数据集类型
}
}
layer {
name: "mnist"
type: "Data"
top: "data"
top: "label"
include {
phase: TEST //测试时加载
}
transform_param {
scale: 0.00390625
}
data_param {
source: "examples/mnist/mnist_test_lmdb"
batch_size: 100
backend: LMDB
}
}
数据层比较清晰,无论是TEST还是TRAIN都是读取数据输出data
和label
。
接下来就是模型的卷积层组合了:
layer {
name: "conv1"
type: "Convolution" //类型为卷积
bottom: "data"
top: "conv1"
param {
lr_mult: 1 // weights学习率
}
param {
lr_mult: 2 // bias学习率,设置为2更容易收敛
}
convolution_param {
num_output: 20 //输出多少个特征图个数 即卷积核数目
kernel_size: 5 // 卷积核大小
stride: 1 //步长
weight_filler {
type: "xavier" //权重初始化类型
}
bias_filler {
type: "constant" // bias初始化类型 constant默认填充0
}
}
}
layer {
name: "pool1"
type: "Pooling" //池化
bottom: "conv1"
top: "pool1"
pooling_param {
pool: MAX //最大池化
kernel_size: 2
stride: 2
}
}
layer {
name: "conv2"
type: "Convolution"
bottom: "pool1"
top: "conv2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 50
kernel_size: 5
stride: 1
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "pool2"
type: "Pooling"
bottom: "conv2"
top: "pool2"
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
}
}
看完前面用于特征提取的卷积层,下面看看分类的FC层:
layer {
name: "ip1"
type: "InnerProduct" // FC层
bottom: "pool2"
top: "ip1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
inner_product_param {
num_output: 500
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "relu1"
type: "ReLU" //激活函数
bottom: "ip1"
top: "ip1"
}
layer {
name: "ip2"
type: "InnerProduct"
bottom: "ip1"
top: "ip2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
inner_product_param {
num_output: 10
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
FC层输出分类结果,接下来就是计算精度和损失了:
layer {
name: "accuracy"
type: "Accuracy" //输出精度
bottom: "ip2"
bottom: "label"
top: "accuracy"
include {
phase: TEST
}
}
layer {
name: "loss"
type: "SoftmaxWithLoss" //softmax and the multinomial logistic loss
bottom: "ip2"
bottom: "label"
top: "loss"
}
Caffe自带了绘图工具./python/drew_net.py
,可使用该工具来绘制模型图。(使用该工具需要在caffe目录下执行make pycaffe
操作):
使用绘图工具绘制该模型图:
~/caffe# python/draw_net.py examples/mnist/lenet_train_test.prototxt examples/mnist/lenet_train_test.png
在定义Layer时可以指定Layer在模型内的运行规则,模板如下:
layer{
// ... layer definition ...
inlcude: {
phase: TRAIN
}
}
这就是layer规则模板,控制layer在模型的状态,可以在./src/caffe/proto/caffe.proto
获取更多信息和主题。
在上面例子中,大部分的layer没有设置规则,默认情况是该layer一直存在模型中。注意到accuracy
layer 只在TEST阶段使用,设置了100次迭代计算一次,设置见lenet_solver.prototxt
。
上面定义了模型的结构,下面该设置训练模型相关参数.
参考文件./examples/mnist/lenet_solver.prototxt
:
# The train/test net protocol buffer definition
# train/test 模型结构
net: "examples/mnist/lenet_train_test.prototxt"
# test_iter specifies how many forward passes the test should carry out.
# In the case of MNIST, we have test batch size 100 and 100 test iterations,
# covering the full 10,000 testing images.
test_iter: 100
# Carry out testing every 500 training iterations.
# 指定每500次计算一下精度
test_interval: 500
# The base learning rate, momentum and the weight decay of the network.
# 学习率设置
base_lr: 0.01
momentum: 0.9
weight_decay: 0.0005
# The learning rate policy
lr_policy: "inv"
gamma: 0.0001
power: 0.75
# Display every 100 iterations
# 设置100次显示一下状态
display: 100
# The maximum number of iterations
# 最大迭代次数
max_iter: 10000
# snapshot intermediate results
# 保存快照
snapshot: 5000
snapshot_prefix: "examples/mnist/lenet"
# solver mode: CPU or GPU
solver_mode: GPU
Caffe提供了一个训练脚本,在./examples/mnist/train_lenet.sh
,我们看看都写了啥:
#!/usr/bin/env sh
set -e
./build/tools/caffe train --solver=examples/mnist/lenet_solver.prototxt $@
可以看到,这里调用了./build/tools/caffe train 然后指定对应的优化器文件,即--solver=examples/mnist/lenet_solver.prototxt
。
调用时输出训练信息:
I1213 17:37:21.999351 30925 layer_factory.hpp:77] Creating layer mnist
I1213 17:37:21.999413 30925 db_lmdb.cpp:35] Opened lmdb examples/mnist/mnist_train_lmdb
I1213 17:37:21.999428 30925 net.cpp:84] Creating Layer mnist
I1213 17:37:21.999433 30925 net.cpp:380] mnist -> data
I1213 17:37:21.999445 30925 net.cpp:380] mnist -> label
I1213 17:37:22.000012 30925 data_layer.cpp:45] output data size: 64,1,28,28
I1213 17:37:22.000969 30925 net.cpp:122] Setting up mnist
I1213 17:37:22.000979 30925 net.cpp:129] Top shape: 64 1 28 28 (50176)
I1213 17:37:22.000982 30925 net.cpp:129] Top shape: 64 (64)
...
I1213 17:37:29.454346 30925 solver.cpp:447] Snapshotting to binary proto file examples/mnist/lenet_iter_5000.caffemodel
I1213 17:37:29.459178 30925 sgd_solver.cpp:273] Snapshotting solver state to binary proto file examples/mnist/lenet_iter_5000.solverstate
I1213 17:37:29.460712 30925 solver.cpp:330] Iteration 5000, Testing net (#0)
I1213 17:37:29.512395 30934 data_layer.cpp:73] Restarting data prefetching from start.
I1213 17:37:29.513818 30925 solver.cpp:397] Test net output #0: accuracy = 0.9882
...
I1213 17:37:36.706809 30925 solver.cpp:447] Snapshotting to binary proto file examples/mnist/lenet_iter_10000.caffemodel
I1213 17:37:36.710286 30925 sgd_solver.cpp:273] Snapshotting solver state to binary proto file examples/mnist/lenet_iter_10000.solverstate
I1213 17:37:36.712179 30925 solver.cpp:310] Iteration 10000, loss = 0.00240246
I1213 17:37:36.712193 30925 solver.cpp:330] Iteration 10000, Testing net (#0)
I1213 17:37:36.765053 30934 data_layer.cpp:73] Restarting data prefetching from start.
I1213 17:37:36.766742 30925 solver.cpp:397] Test net output #0: accuracy = 0.9913
I1213 17:37:36.766758 30925 solver.cpp:397] Test net output #1: loss = 0.0275297 (* 1 = 0.0275297 loss)
I1213 17:37:36.766762 30925 solver.cpp:315] Optimization Done.
I1213 17:37:36.766764 30925 caffe.cpp:259] Optimization Done.
每一大轮迭代,都会输出相关训练信息,包括学习率,loss,accuracy等。同时因为设置了每5000次训练保存一次Snapshotting。