tensorflow/keras版(项目地址https://github.com/matterport)。
服务器cuda为8.0版本,尝试升级成9.1没成功,之后再降级回8.0了,python版本一开始为py36,提示错误libcublas.so.8.0:cannot open shared object file:No such file or directory,经过一番倒腾,tensorflow/keras降级等等,依然无效,最后把py36降级为py35反倒成功了,测试了py36以下版本都能兼容cuda-8。 以下是在使用Mask-RCNN训练自己数据时遇到的各种坑的记录,哈哈,大数据scala写习惯了,突然转到深度学习来,还有点吃不消。
一些步骤:
安装anaconda(Python 3.4+)
1.下载installer清华镜像https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/
2.进入下载好文件的文件夹,在命令行中输入:
bash Anaconda2-5.0.1-Linux-x86_64.sh
3.查看现有环境
conada list
新建环境:
conda create --name py35 python=3.5
source activate py35
需要软件列表:
numpy
scipy
Pillow
cython
matplotlib
scikit-image
tensorflow>=1.3.0
keras>=2.0.8
opencv-python
h5py
imgaug
IPython
pip安装tensorflow太慢,可以下载安装:
版本选择网址(https://mirrors.tuna.tsinghua.edu.cn/help/tensorflow/)
pip install \
-i https://pypi.tuna.tsinghua.edu.cn/simple/ \
https://mirrors.tuna.tsinghua.edu.cn/tensorflow/linux/gpu/tensorflow_gpu-1.4.1-cp35-cp35m-linux_x86_64.whl
安装cuda的方法:
sudo dpkg -i cuda-repo-ubuntu1604-9-1-local_9.1.85-1_amd64.deb
sudo apt-key add /var/cuda-repo-9-1-local/7fa2af80.pub
sudo apt-get update
sudo apt-get install cuda (sudo apt-get upgrade cuda)
在.bashrc中写入:
export CUDA_HOME=/usr/local/cuda (对cuda-8.0做了软链接)
export PATH=/usr/local/cuda/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
打印cuda版本信息:
nvcc -V
(0)
###################################--(项目过程)--##########################################
(1)resize图片成统一大小
(2)labelme标注图片,得到json标识文件
(3)生成一些文件,
(py35) byz@ubuntu:~/Mask/Mask_RCNN/net_img$ python json_to_dataset.py rgb_5.json
# 移动生成文件到指定目录
(py35) byz@ubuntu:~/Mask/Mask_RCNN/net_img$ mv rgb_*_json json/
(py35) byz@ubuntu:~/Mask/Mask_RCNN/net_img$ mv rgb*.jpg rgb/
(4)将掩码标签 label.png 16位转8位,存到./mask/下
(py35) byz@ubuntu:~/Mask/Mask_RCNN/net_img$ python 16_to_8.py
(5)运行程序
(py35) byz@ubuntu:~/Mask/Mask_RCNN$ python train_rcnn_test.py
遇到的问题:
(一)
###################--(制作训练所需数据集:label.png、info.yaml)--#########################
byz@ubuntu:~/app/coco/tmp$ labelme_json_to_dataset rgb_1.jpg
问题:
/home/byz/anaconda3/lib/python3.6/site-packages/labelme/cli/json_to_dataset.py:14: UserWarning: This script is aimed to demonstrate how to convert the
JSON file to a single image dataset, and not to handle
multiple JSON files to generate a real-use dataset.
warnings.warn("This script is aimed to demonstrate how to convert the\n"
Traceback (most recent call last):
File "/home/byz/anaconda3/bin/labelme_json_to_dataset", line 11, in
sys.exit(main())
File "/home/byz/anaconda3/lib/python3.6/site-packages/labelme/cli/json_to_dataset.py", line 33, in main
data = json.load(open(json_file))
File "/home/byz/anaconda3/lib/python3.6/json/__init__.py", line 296, in load
return loads(fp.read(),
File "/home/byz/anaconda3/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
解决:
自己写json_to_dataset.py
import argparse
import json
import os
import os.path as osp
import warnings
import PIL.Image
import yaml
from labelme import utils
def main():
warnings.warn("This script is aimed to demonstrate how to convert the\n"
"JSON file to a single image dataset, and not to handle\n"
"multiple JSON files to generate a real-use dataset.")
parser = argparse.ArgumentParser()
parser.add_argument('json_file')
parser.add_argument('-o', '--out', default=None)
args = parser.parse_args()
json_file = args.json_file
if args.out is None:
out_dir = osp.basename(json_file).replace('.', '_')
out_dir = osp.join(osp.dirname(json_file), out_dir)
else:
out_dir = args.out
if not osp.exists(out_dir):
os.mkdir(out_dir)
data = json.load(open(json_file))
print("json---load")
img = utils.img_b64_to_arr(data['imageData'])
label_name_to_value = {'_background_': 0}
for shape in data['shapes']:
label_name = shape['label']
if label_name in label_name_to_value:
label_value = label_name_to_value[label_name]
else:
label_value = len(label_name_to_value)
label_name_to_value[label_name] = label_value
# label_values must be dense
label_values, label_names = [], []
for ln, lv in sorted(label_name_to_value.items(), key=lambda x: x[1]):
label_values.append(lv)
label_names.append(ln)
assert label_values == list(range(len(label_values)))
lbl = utils.shapes_to_label(img.shape, data['shapes'], label_name_to_value)
captions = ['{}: {}'.format(lv, ln)
for ln, lv in label_name_to_value.items()]
lbl_viz = utils.draw_label(lbl, img, captions)
PIL.Image.fromarray(img).save(osp.join(out_dir, 'img.png'))
PIL.Image.fromarray(lbl).save(osp.join(out_dir, 'label.png'))
PIL.Image.fromarray(lbl_viz).save(osp.join(out_dir, 'label_viz.png'))
with open(osp.join(out_dir, 'label_names.txt'), 'w') as f:
for lbl_name in label_names:
f.write(lbl_name + '\n')
warnings.warn('info.yaml is being replaced by label_names.txt')
info = dict(label_names=label_names)
with open(osp.join(out_dir, 'info.yaml'), 'w') as f:
yaml.safe_dump(info, f, default_flow_style=False)
print('Saved to: %s' % out_dir)
if __name__ == '__main__':
main()
使用:
(py35) byz@ubuntu:~/Mask/Mask_RCNN/net_img$ python json_to_dataset.py rgb_5.json
(二)
##################--(ubuntu_opencv将16位灰度图片转化为8位——遇到的编译问题)--###################
大家在加载灰度图时,一定要看准图片存储格式位数
opencv默认为8位读取,如果该图为16位,则读取为全0,导致程序出错
以下代码只需修改路径,可以批量处理图片
#include
#include
#include
#include
using namespace std;
using namespace cv;
int main(void){
char buff1[100];
char buff2[100];
for(int i=1;i<901;i++){
sprintf(buff1,"/home/byz/app/coco/tmp/json/rgb_%d_json/label.png",i);
sprintf(buff2,"/home/byz/app/coco/tmp/mask/%d.png",i);
Mat src;
//Mat dst;
src=imread(buff1,CV_LOAD_IMAGE_UNCHANGED);
Mat ff=Mat::zeros(src.rows,src.cols,CV_8UC1);
for(int k=0;k
ff.at
}
}
//src.copyTo(dst);
//imshow("haha",ff*100);
//waitKey(0);
imwrite(buff2,ff);
}
return 0;
}
ubuntu_opencv将16位灰度图片转化为8位——遇到的编译问题:
(py35) byz@ubuntu:~/Mask/Mask_RCNN/net_img$ sudo g++ 16_to_8.cpp -lpthread -o 16_to_8
/tmp/ccKQtPkQ.o: In function `main':
16_to_8.cpp:(.text+0xaf): undefined reference to `cv::imread(cv::String const&, int)'
16_to_8.cpp:(.text+0x106): undefined reference to `cv::Mat::zeros(int, int, int)'
16_to_8.cpp:(.text+0x222): undefined reference to `cv::imwrite(cv::String const&, cv::_InputArray const&, std::vector
/tmp/ccKQtPkQ.o: In function `cv::String::String(char const*)':
16_to_8.cpp:(.text._ZN2cv6StringC2EPKc[_ZN2cv6StringC5EPKc]+0x54): undefined reference to `cv::String::allocate(unsigned long)'
/tmp/ccKQtPkQ.o: In function `cv::String::~String()':
16_to_8.cpp:(.text._ZN2cv6StringD2Ev[_ZN2cv6StringD5Ev]+0x14): undefined reference to `cv::String::deallocate()'
/tmp/ccKQtPkQ.o: In function `cv::String::operator=(cv::String const&)':
16_to_8.cpp:(.text._ZN2cv6StringaSERKS0_[_ZN2cv6StringaSERKS0_]+0x28): undefined reference to `cv::String::deallocate()'
/tmp/ccKQtPkQ.o: In function `cv::Mat::~Mat()':
16_to_8.cpp:(.text._ZN2cv3MatD2Ev[_ZN2cv3MatD5Ev]+0x39): undefined reference to `cv::fastFree(void*)'
/tmp/ccKQtPkQ.o: In function `cv::Mat::operator=(cv::Mat const&)':
16_to_8.cpp:(.text._ZN2cv3MataSERKS0_[_ZN2cv3MataSERKS0_]+0x115): undefined reference to `cv::Mat::copySize(cv::Mat const&)'
/tmp/ccKQtPkQ.o: In function `cv::Mat::release()':
16_to_8.cpp:(.text._ZN2cv3Mat7releaseEv[_ZN2cv3Mat7releaseEv]+0x4b): undefined reference to `cv::Mat::deallocate()'
collect2: error: ld returned 1 exit status
解决:
(py35) byz@ubuntu:~/Mask/Mask_RCNN/net_img$ sudo g++ 16_to_8.cpp -lpthread -o 16_to_8 `pkg-config --cflags --libs opencv`
(py35) byz@ubuntu:~/Mask/Mask_RCNN/net_img$ ll
total 164
drwxrwxr-x 5 byz byz 4096 Apr 27 11:58 ./
drwxrwxr-x 8 byz byz 4096 Apr 27 11:38 ../
-rwxr-xr-x 1 root root 59424 Apr 27 11:58 16_to_8*
-rw-rw-r-- 1 byz byz 833 Apr 27 11:48 16_to_8.cpp
补充(如果是pthread问题):
/tmp/ccM2tvqF.o: In function `main':
thread_c.c:(.text+0x1f): undefined reference to `pthread_create'
thread_c.c:(.text+0x52): undefined reference to `pthread_create'
thread_c.c:(.text+0x7d): undefined reference to `pthread_join'
thread_c.c:(.text+0xa9): undefined reference to `pthread_join'
collect2: ld returned 1 exit status
解决:
因为pthread库不是Linux系统默认的库,连接时需要使用库libpthread.a,所以在使用pthread_create创建线程时,在编译中要加-lpthread参数:gcc test_thread.c -lpthread -o test_thread.
(三)
###############--(运行自己的代码:~/Mask/Mask_RCNN$ python train_rcnn_test.py报错)--###################
问题:
(RuntimeError: Invalid DISPLAY variable)
具体:
Traceback (most recent call last):
File "train_rcnn_test.py", line 210, in
visualize.display_top_masks(image, mask, class_ids, dataset_train.class_names)
File "/home/byz/Mask/Mask_RCNN/mrcnn/visualize.py", line 304, in display_top_masks
display_images(to_display, titles=titles, cols=limit + 1, cmap="Blues_r")
File "/home/byz/Mask/Mask_RCNN/mrcnn/visualize.py", line 48, in display_images
plt.figure(figsize=(14, 14 * rows // cols))
File "/home/byz/anaconda3/envs/py35/lib/python3.5/site-packages/matplotlib/pyplot.py", line 548, in figure
**kwargs)
File "/home/byz/anaconda3/envs/py35/lib/python3.5/site-packages/matplotlib/backend_bases.py", line 161, in new_figure_manager
return cls.new_figure_manager_given_figure(num, fig)
File "/home/byz/anaconda3/envs/py35/lib/python3.5/site-packages/matplotlib/backend_bases.py", line 167, in new_figure_manager_given_figure
canvas = cls.FigureCanvas(figure)
File "/home/byz/anaconda3/envs/py35/lib/python3.5/site-packages/matplotlib/backends/backend_qt5agg.py", line 24, in __init__
super(FigureCanvasQTAgg, self).__init__(figure=figure)
File "/home/byz/anaconda3/envs/py35/lib/python3.5/site-packages/matplotlib/backends/backend_qt5.py", line 234, in __init__
_create_qApp()
File "/home/byz/anaconda3/envs/py35/lib/python3.5/site-packages/matplotlib/backends/backend_qt5.py", line 125, in _create_qApp
raise RuntimeError('Invalid DISPLAY variable')
RuntimeError: Invalid DISPLAY variable
1.问题:在本地用matplotlib绘图可以,但是在ssh远程绘图的时候会报错 RuntimeError: Invalid DISPLAY variable
2.原因:matplotlib的默认backend是TkAgg,而FltkAgg, GTK, GTKAgg, GTKCairo, TkAgg , Wx or WxAgg这几个backend都要求有GUI图形界面的,所以在ssh操作的时候会报错.
import matplotlib.pyplot as plt
plt.get_backend()
解决:
指定不需要GUI的backend(Agg, Cairo, PS, PDF or SVG)
import matplotlib.pyplot as plt
plt.switch_backend('agg')
修改源码:
mrcnn/visualize.py --
def display_images():
titles = titles if titles is not None else [""] * len(images)
plt.switch_backend('agg') #<-------增加
rows = len(images) // cols + 1
plt.figure(figsize=(14, 14 * rows // cols))
(四)
#############################--(训练MaskRCNN模型)--###############################
执行:
model = modellib.MaskRCNN(mode="training", config=config, model_dir=MODEL_DIR)
model.load_weights(filepath=COCO_MODEL_PATH, by_name=True,
exclude=["mrcnn_class_logits", "mrcnn_bbox_fc",
"mrcnn_bbox", "mrcnn_mask"])
问题:
Traceback (most recent call last):
File "train_rcnn_test.py", line 209, in
model = modellib.MaskRCNN(mode="training", config=config, model_dir=MODEL_DIR)
File "/home/byz/pro/Mask_RCNN-master/mrcnn/model.py", line 1820, in __init__
self.keras_model = self.build(mode=mode, config=config)
File "/home/byz/pro/Mask_RCNN-master/mrcnn/model.py", line 1976, in build
train_bn=config.TRAIN_BN)
File "/home/byz/pro/Mask_RCNN-master/mrcnn/model.py", line 937, in fpn_classifier_graph
name="mrcnn_class")(mrcnn_class_logits)
File "/home/byz/anaconda3/envs/py35/lib/python3.5/site-packages/keras/engine/topology.py", line 619, in __call__
output = self.call(inputs, **kwargs)
File "/home/byz/anaconda3/envs/py35/lib/python3.5/site-packages/keras/layers/wrappers.py", line 213, in call
y = self.layer.call(inputs, **kwargs)
File "/home/byz/anaconda3/envs/py35/lib/python3.5/site-packages/keras/layers/core.py", line 304, in call
return self.activation(inputs)
File "/home/byz/anaconda3/envs/py35/lib/python3.5/site-packages/keras/activations.py", line 29, in softmax
return K.softmax(x)
File "/home/byz/anaconda3/envs/py35/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 2963, in softmax
return tf.nn.softmax(x, axis=axis)
TypeError: softmax() got an unexpected keyword argument 'axis'
解决:
Keras与tensorflow版本不相符
我的是 tensorflow-gpu (1.4.1)+Keras (2.1.6)————降到2.1.0解决
新问题,在模型训练时:
Traceback (most recent call last):
File "train_rcnn_test.py", line 232, in
layers='heads')
File "/home/byz/Mask/Mask_RCNN/mrcnn/model.py", line 2308, in train
self.compile(learning_rate, self.config.LEARNING_MOMENTUM)
File "/home/byz/Mask/Mask_RCNN/mrcnn/model.py", line 2143, in compile
tf.reduce_mean(layer.output, keepdims=True)
TypeError: reduce_mean() got an unexpected keyword argument 'keepdims'
原因:
tensorflow版本可能不兼容 pix elnetas keep_dims: Deprecated alias for keepdims.
1.7版本已经弃用该关键字,并得到警告
Instructions for updating:
keep_dims is deprecated, use keepdims instead
尝试升级或更改TF关键字from keepdims to keep_dims
解决:
修改mask-rcnn使用到该地方的代码(/Mask_RCNN/mrcnn/model.py)
(五)
###################################--()--##########################################
问题:
Exception ignored in:
Traceback (most recent call last):
具体:
Traceback (most recent call last):
File "train_rcnn_test.py", line 232, in
layers='heads')
File "/home/byz/Mask/Mask_RCNN/mrcnn/model.py", line 2308, in train
self.compile(learning_rate, self.config.LEARNING_MOMENTUM)
File "/home/byz/Mask/Mask_RCNN/mrcnn/model.py", line 2143, in compile
tf.reduce_mean(layer.output, keepdims=True)
TypeError: reduce_mean() got an unexpected keyword argument 'keepdims'
Exception ignored in:
Traceback (most recent call last):
File "/home/byz/anaconda3/envs/py35/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 696, in __del__
TypeError: 'NoneType' object is not callable
出现该问题的原因:
这是因为在编译时,只删除了.pyc文件,留下了__pycache__文件夹。
一般在编译python文件后,会生成__pycache__文件夹,其中包括你import的其他python文件生成的.pyc文件。
如下图中,我运行了training.py文件,该文件中import另外两个文件model.py和input_data.py,所以生成了这两个文件的.pyc文件。
解决:
再次编译training.py文件(你的python文件)时,应该把__pycache__文件夹整个删除。
(py35) byz@ubuntu:~/Mask/Mask_RCNN$ ll mrcnn/
total 228
drwxrwxr-x 3 byz byz 4096 Apr 27 17:50 ./
drwxrwxr-x 10 byz byz 4096 Apr 27 19:21 ../
-rw-rw-r-- 1 byz byz 8374 Apr 27 11:30 config.py
-rw-rw-r-- 1 byz byz 1 Apr 27 11:30 __init__.py
-rw-rw-r-- 1 byz byz 123897 Apr 27 11:30 model.py
-rw-rw-r-- 1 byz byz 7022 Apr 27 11:30 parallel_model.py
drwxrwxr-x 2 byz byz 4096 Apr 27 15:36 __pycache__/
-rw-rw-r-- 1 byz byz 10037 Apr 27 17:37 train_rcnn_test.py
-rw-rw-r-- 1 byz byz 33390 Apr 27 11:30 utils.py
-rw-rw-r-- 1 byz byz 19097 Apr 27 15:36 visualize.py
(py35) byz@ubuntu:~/Mask/Mask_RCNN$ rm -rf ./mrcnn/__pycache__/
(六)
###################################--(python多线程问题)--##########################################
警告:
UserWarning('Using a generator with `use_multiprocessing=True`'
/home/byz/anaconda3/envs/py35/lib/python3.5/site-packages/keras/engine/training.py:2022: UserWarning: Using a generator with `use_multiprocessing=True` and multiple workers may duplicate your data. Please consider using the`keras.utils.Sequence class.
UserWarning('Using a generator with `use_multiprocessing=True`'
请看:
https://github.com/keras-team/keras/pull/8662
(七)
#################################--(train_rcnn)--######################################
执行:
model.train(dataset_train, dataset_val,
learning_rate=config.LEARNING_RATE,
epochs=1,
layers='heads')
报错:
Traceback (most recent call last):
File "train_rcnn_test.py", line 232, in
layers='heads')
File "/home/byz/Mask/Mask_RCNN/mrcnn/model.py", line 2328, in train
use_multiprocessing=True,
File "/home/byz/anaconda3/envs/py35/lib/python3.5/site-packages/keras/legacy/interfaces.py", line 87, in wrapper
return func(*args, **kwargs)
File "/home/byz/anaconda3/envs/py35/lib/python3.5/site-packages/keras/engine/training.py", line 2077, in fit_generator
class_weight=class_weight)
File "/home/byz/anaconda3/envs/py35/lib/python3.5/site-packages/keras/engine/training.py", line 1791, in train_on_batch
check_batch_axis=True)
File "/home/byz/anaconda3/envs/py35/lib/python3.5/site-packages/keras/engine/training.py", line 1409, in _standardize_user_data
exception_prefix='input')
File "/home/byz/anaconda3/envs/py35/lib/python3.5/site-packages/keras/engine/training.py", line 154, in _standardize_input_data
str(array.shape))
ValueError: Error when checking input: expected input_image_meta to have shape (None, 16) but got array with shape (1, 17)
解决:
修改(py35) byz@ubuntu:~/Mask/Mask_RCNN$ vim mrcnn/config.py
一开始以为IMAGE_META_SIZE少了个类
self.IMAGE_META_SIZE = 1 + 3 + 3 + 4 + 1 + self.NUM_CLASSES
就改成成:
self.IMAGE_META_SIZE = 1 + 1 + 3 + 3 + 4 + 1 + self.NUM_CLASSES
程序也能跑,因为。。我代码少写了个类:
NUM_CLASSES = 1 + 3 (我有4类,应该是1+4,其中1是背景,糊涂了)
(八)
################################--(model.detect)--################################
运行:
results = model.detect([original_image], verbose=1)
报错:
+++++++=========模型输出:
original_image shape: (512, 512, 3) min: 0.00000 max: 255.00000 uint8
image_meta shape: (17,) min: 0.00000 max: 512.00000 int64 <_________________(这)
gt_class_id shape: (2, 4) min: 92.00000 max: 424.00000 int32
gt_bbox shape: (2, 4) min: 92.00000 max: 424.00000 int32
gt_mask shape: (512, 512, 2) min: 0.00000 max: 1.00000 uint8
Processing 1 images
image shape: (512, 512, 3) min: 0.00000 max: 255.00000 uint8
molded_images shape: (1, 512, 512, 3) min: -123.70000 max: 151.10000 float64
image_metas shape: (1, 16) min: 0.00000 max: 512.00000 int64 <_________________(这)
anchors shape: (1, 65472, 4) min: -0.53137 max: 1.40612 float32
Traceback (most recent call last):
File "train_rcnn_test.py", line 280, in
results = model.detect([original_image], verbose=1)
File "/home/byz/Mask/Mask_RCNN/mrcnn/model.py", line 2478, in detect
self.keras_model.predict([molded_images, image_metas, anchors], verbose=0)
File "/home/byz/anaconda3/envs/py35/lib/python3.5/site-packages/keras/engine/training.py", line 1730, in predict
check_batch_axis=False)
File "/home/byz/anaconda3/envs/py35/lib/python3.5/site-packages/keras/engine/training.py", line 154, in _standardize_input_data
str(array.shape))
ValueError: Error when checking : expected input_image_meta to have shape (None, 17) but got array with shape (1, 16)
Exception ignored in:
Traceback (most recent call last):
File "/home/byz/anaconda3/envs/py35/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 696, in __del__
TypeError: 'NoneType' object is not callable
解决:
检查分类数目是否正确
博主w~x:lovebyz99
(交流请备注,对创业有想法欢迎联系)