目标检测方法分为One-stage检测和Two-stage两个分支,从字面意思来看,就是将目标检测算法的提取候选区域和框出目标分两步进行还是一步到位,Two-stage属于候选区域/框 + 深度学习分类,即通过提取候选区域,并对相应区域进行以深度学习方法为主的分类的方案;One-stage算法速度比较快,因为其不再单独生成proposal框。
One-stage
two-stage算法会先使用一个网络生成proposal,RPN网络接在图像特征提取网络backbone后,会设置RPN loss(bbox regression loss+classification loss)对RPN网络进行训练,RPN生成的proposal再送到后面的网络中进行更精细的bbox regression和classification。
Two-stage
One-stage追求速度舍弃了two-stage架构,即不再设置单独网络生成proposal,而
是直接在feature map上进行密集抽样,产生大量的先验框,如YOLO的网格方法。这些先验框没有经过两步处理,且框的尺寸往往是人为规定。
One stage & Two stage 算法应用:
two-stage算法主要是RCNN系列,包括RCNN, Fast-RCNN,Faster-RCNN。之后的Mask-RCNN融合了Faster-RCNN架构、ResNet和FPN(Feature Pyramid Networks)backbone,以及FCN里的segmentation方法,在完成了segmentation的同时也提高了detection的精度。
one-stage算法最典型的是YOLO,该算法速度极快。
Yolo(You Only Look Once),从取名上能够体现出算法检测快的特点。
YOLO算法采用一个单独的CNN模型实现end-to-end(端到端)的目标检测:
非极大值抑制(NMS)已在笔者《CV学习笔记-边缘提取》中介绍过,读者不清楚的可以自行查阅或者参考此博文。
上面图片中的例子中将图片划分为了 3 × 3 3\times 3 3×3的网格单元,这模拟了Yolo算法中的将输入的图片划分为 s × s s\times s s×s网格的过程,只不过 s = 3 s=3 s=3,当目标的中心点在某个格子中出现时,那么算法就以这个格子为中心检测这个目标,在上图的例子中,目标就是蓝衣服的人、狗、小马驹。
Yolo网络的输入与输出尺寸相同(但是通道数不同),也即图片输入网络中尺寸为 S × S S\times S S×S,最后的输出是 S × S × n S\times S\times n S×S×n, n n n即为通道数。重点是通道数的确定,通道数是由标定框、框置信度、检测目标的类别数。后面的章节会重点介绍。
输出的通道数等于 2 × ( 4 + 1 ) + [ c l a s s _ n u m ] 2\times (4+1)+[class\_num] 2×(4+1)+[class_num],这里的2指的是每个格子有两个标定框(论文指出的),4代表标定框的坐标信息, 1代表标定框的置信度, [ c l a s s _ n u m ] [class\_num] [class_num]是检测目标的类别数。
根据上节所提的通道数的计算,也反应了标定框的参考指标,所谓4代表了标定框的坐标信息,就如上图中的绿色部分, t x , t y , t w , t h t_x,t_y,t_w,t_h tx,ty,tw,th分别指标定框的x坐标、y坐标、宽度、高度信息,1表示的标定框的置信度,可以通俗的想象成是目标检测效果的分数或者说是准确度信息,而粉色部分就是对应的每个类别的概率信息,故而检测的目标有多少种类别,这粉色数据就有多少维。
Yolo对于标定框中心坐标的预测,并不是直接给出中心坐标的确切坐标,Yolo会给出:
以上图为例,如果中心的预测是 ( 0.4 , 0.7 ) (0.4, 0.7) (0.4,0.7),则中心在 13 × 13 13\times13 13×13 特征图上的坐标是 ( 6.4 , 6.7 ) (6.4, 6.7) (6.4,6.7)(红色单元的左上角坐标是 ( 6 , 6 ) (6,6) (6,6))。
但是,如果预测到的 x,y 坐标大于 1,比如 ( 1.2 , 0.7 ) (1.2,0.7) (1.2,0.7) 。那么预测的中心坐标是 ( 7.2 , 6.7 ) (7.2, 6.7) (7.2,6.7) 。注意该中心在红色单元右侧的单元中。这打破了 YOLO 背后的理论,因为如果我们假设红色框负责预测目标狗,那么狗的中心必须在红色单元中,不应该在它旁边的网格单元中。
因此,为了解决这个问题,我们对输出执行 sigmoid 函数,将输出压缩到区间 0 到 1 之间,有效确保中心处于执行预测的网格单元中。
那么所谓标定框的置信度按照以下的公式计算,这样就计算出各个标定框的类别置信度(class-specific confidence scores/ class scores),其表达的是该标定框中目标属于各个类别的可能性大小以及标定框匹配目标的好坏。
P r ( C l a s s i ∣ O b j e c t ) ⋅ P r ( O b j e c t ) ⋅ I O U p r e d t r u t h = P r ( C l a s s i ) × I O U p r e d t r u t h Pr(Class_i|Object)\cdot Pr(Object)\cdot IOU_{pred}^{truth}=Pr(Class_i)\times IOU_{pred}^{truth} Pr(Classi∣Object)⋅Pr(Object)⋅IOUpredtruth=Pr(Classi)×IOUpredtruth
其中, P r ( O b j e c t ) ⋅ I O U p r e d t r u t h Pr(Object)\cdot IOU_{pred}^{truth} Pr(Object)⋅IOUpredtruth中左边代表包含这个标定框的格子里是否有目标。有=1没有=0。右边代表标定框的准确程度, 右边的部分是把两个标定框(一个是Ground truth,一个是预测的标定框)进行一个IOU操作,即两个标定框的交集比并集,数值越大,即标定框重合越多,越准确。
IOU交并比在笔者的上一篇博文(《CV学习笔记-Faster-RCNN》中已经介绍过,若有不清楚的读者可以翻阅此博文。
公式整体每个网格预测的class信息和bounding box预测的confidence信息相乘,就得到每个bounding box的class-specific confidence score。
在上图的示例中,输入尺寸为 448 × 448 448\times 448 448×448,取尺寸S为7,标定框的个数为2,一共有20个类别,那么输出的尺寸就为 7 × 7 × 30 7\times 7\times 30 7×7×30的一个tensor。
通道的计算方式上面章节中已经提到过,现在将其抽象总结,设:
那么网络的输出为 S × S × ( 5 × B + C ) S\times S\times (5\times B +C) S×S×(5×B+C)的一个tensor
将上面图中的过程拆解开来,精细化的过程如下:
其进行了20多次卷积操作以及4次max pooling,其中 3 × 3 3\times 3 3×3卷积用于提取特征, 1 × 1 1\times 1 1×1卷积用于压缩特征改变通道,最后将图像压缩到 7 × 7 × f i l t e r 7\times 7\times filter 7×7×filter的大小,相当于将整个图像划分为 7 × 7 7\times 7 7×7的网格,每个网格负责自己这一块区域的目标检测。
整个网络最后利用全连接层输出尺寸为 7 × 7 × 30 7\times 7\times 30 7×7×30, 7 × 7 7\times 7 7×7表示的是 7 × 7 7\times 7 7×7的网格,通道数30由以下部分组成:
有目标的中心点像素找出标定框(示例中是两个),原图输入尺寸为 7 × 7 7\times 7 7×7,输出的前两个维度保持不变,通道数确定过程如下:
一个标定框有 ( t x , t y , t w , t h ) (t_x,t_y,t_w,t_h) (tx,ty,tw,th)和一个置信度5个参数,那么两个框的5个参数信息进行interpretation整合,整合完成后有10。
此时由于每个框要检测的类别数量为20,故拼接上20个种类的概率置信度信息,这样通道数就变成了10+20=30个,那么最终网络的输出就是 7 × 7 × 30 7\times 7\times 30 7×7×30的tensor。
根据类别置信度的计算公式
P r ( C l a s s i ∣ O b j e c t ) ⋅ P r ( O b j e c t ) ⋅ I O U p r e d t r u t h = P r ( C l a s s i ) × I O U p r e d t r u t h Pr(Class_i|Object)\cdot Pr(Object)\cdot IOU_{pred}^{truth}=Pr(Class_i)\times IOU_{pred}^{truth} Pr(Classi∣Object)⋅Pr(Object)⋅IOUpredtruth=Pr(Classi)×IOUpredtruth
需要将标定框的置信度和每个类别的置信度信息进行乘积,4+1中的1代表的 P r ( O b j e c t ) Pr(Object) Pr(Object),而后面的每个类别的置信度信息为后验概率 P r ( C l a s s i ∣ O b j e c t ) Pr(Class_i|Object) Pr(Classi∣Object),这样得到一个 20 × 1 20\times 1 20×1的向量(bbox),记为bbx(x为序号),向量中有类别的分数。
对每一个网格的每一个bbox执行同样操作: 7x7x2 = 98 bbox (每个bbox既有对应的class信息又有坐标信息)
对每个网格做完操作之后得到98个bbox
得到每个bbox的class-specific confidence score以后,设置阈值,滤掉得分低的boxes,按类别分数分别对98个bbox进行排序筛选,对保留的boxes进行NMS处理,就得到最终的检测结果。
排序筛选的过程展开来看(以类别狗为例)
以最大值作为bbox_max,并与比它小的非0值(bbox_cur)做比较,其他非0值代表本中心点检测到其他类别的概率也是有的,需要参考指标IOU进行下一步筛选。
当有保留值时,递归进行,以下一个非0 bbox_cur(0.2)作为bbox_max继续比较IOU:
最终剩下需要的框
返回主线的流程,对保留的boxes进行NMS处理,就得到最终的检测结果。
对于结果的分析即是:
最终的结果效果示意图:
整个Yolo的过程图示化:
当我们使用一种算法的时候,要清楚算法的优缺点,通过需求和具体实际开发环境(如数据集,精准度等)进行trade-off,Yolo的优点在上面已经交代清楚,最大的特点就是快,而算法肯定有他的局限性:
Yolo2的网络结构如下:
改进点:
13 × 13 × 425 13\times 13\times 425 13×13×425的计算方式为 13 × 13 13\times 13 13×13是将输入网格划分为 13 × 13 13\times 13 13×13的网格,425( 85 × 5 85\times 5 85×5),其中85(80+5)中的80为coco数据集中的个类别,5就是每个框的 ( t x , t y , t w , t h ) (t_x,t_y,t_w,t_h) (tx,ty,tw,th)和一个置信度; 85 × 5 85\times 5 85×5的5是对应了5个先验框。
Yolo2中的维度聚类(Dimension Clusters):
K-means聚类获取先验框:YOLO2尝试统计出更符合样本中对象尺寸的先验框,这样就可以减少网络微调先验框到实际位置的难度。YOLO2的做法是对训练集中标注的边框进行聚类分析,以寻找尽可能匹配样本的边框尺寸。聚类算法最重要的是选择如何计算两个边框之间的“距离”,对于常用的欧式距离,大边框会产生更大的误差,但我们关心的是边框的IOU。所以,YOLO2在聚类时采用以下公式来计算两个边框之间的“距离”。
d ( b o x , c e n t r o i d ) = 1 − I O U ( b o x , c e n t r o i d ) d(box,centroid)=1-IOU(box,centroid) d(box,centroid)=1−IOU(box,centroid)
在选择不同的聚类k值情况下,得到的k个centroid边框,计算样本中标注的边框与各centroid的Avg IOU。显然,边框数k越多,Avg IOU越大。
YOLO2选择k=5作为边框数量与IOU的折中。对比手工选择的先验框,使用5个聚类框即可达到61 Avg IOU,相当于9个手工设置的先验框60.9 Avg IOU
作者最终选取5个聚类中心作为先验框。对于两个数据集,5个先验框的width和height如下:
COCO: (0.57273, 0.677385), (1.87446, 2.06253), (3.33843, 5.47434), (7.88282, 3.52778), (9.77052,
9.16828)
VOC: (1.3221, 1.73145), (3.19275, 4.00944), (5.05587, 8.09892), (9.47112, 4.84053), (11.2364, 10.0071)
Yolov3是相较v2改进最大,用的最广泛的目标检测网络,其网络结构:
改进点:
工程基于tensorflow实现
yolo_predict.py实现了yolo的主体流程,获取yolo模型的关键代码为model = yolo(config.norm_epsilon, config.norm_decay, self.anchors_path, self.classes_path, pre_train = False)
import os
import config
import random
import colorsys
import numpy as np
import tensorflow as tf
from model.yolo3_model import yolo
class yolo_predictor:
def __init__(self, obj_threshold, nms_threshold, classes_file, anchors_file):
"""
Introduction
------------
初始化函数
Parameters
----------
obj_threshold: 目标检测为物体的阈值
nms_threshold: nms阈值
"""
self.obj_threshold = obj_threshold
self.nms_threshold = nms_threshold
# 预读取
self.classes_path = classes_file
self.anchors_path = anchors_file
# 读取种类名称
self.class_names = self._get_class()
# 读取先验框
self.anchors = self._get_anchors()
# 画框框用
hsv_tuples = [(x / len(self.class_names), 1., 1.)for x in range(len(self.class_names))]
self.colors = list(map(lambda x: colorsys.hsv_to_rgb(*x), hsv_tuples))
self.colors = list(map(lambda x: (int(x[0] * 255), int(x[1] * 255), int(x[2] * 255)), self.colors))
random.seed(10101)
random.shuffle(self.colors)
random.seed(None)
def _get_class(self):
"""
Introduction
------------
读取类别名称
"""
classes_path = os.path.expanduser(self.classes_path)
with open(classes_path) as f:
class_names = f.readlines()
class_names = [c.strip() for c in class_names]
return class_names
def _get_anchors(self):
"""
Introduction
------------
读取anchors数据
"""
anchors_path = os.path.expanduser(self.anchors_path)
with open(anchors_path) as f:
anchors = f.readline()
anchors = [float(x) for x in anchors.split(',')]
anchors = np.array(anchors).reshape(-1, 2)
return anchors
#---------------------------------------#
# 对三个特征层解码
# 进行排序并进行非极大抑制
#---------------------------------------#
def boxes_and_scores(self, feats, anchors, classes_num, input_shape, image_shape):
"""
Introduction
------------
将预测出的box坐标转换为对应原图的坐标,然后计算每个box的分数
Parameters
----------
feats: yolo输出的feature map
anchors: anchor的位置
class_num: 类别数目
input_shape: 输入大小
image_shape: 图片大小
Returns
-------
boxes: 物体框的位置
boxes_scores: 物体框的分数,为置信度和类别概率的乘积
"""
# 获得特征
box_xy, box_wh, box_confidence, box_class_probs = self._get_feats(feats, anchors, classes_num, input_shape)
# 寻找在原图上的位置
boxes = self.correct_boxes(box_xy, box_wh, input_shape, image_shape)
boxes = tf.reshape(boxes, [-1, 4])
# 获得置信度box_confidence * box_class_probs
box_scores = box_confidence * box_class_probs
box_scores = tf.reshape(box_scores, [-1, classes_num])
return boxes, box_scores
# 获得在原图上框的位置
def correct_boxes(self, box_xy, box_wh, input_shape, image_shape):
"""
Introduction
------------
计算物体框预测坐标在原图中的位置坐标
Parameters
----------
box_xy: 物体框左上角坐标
box_wh: 物体框的宽高
input_shape: 输入的大小
image_shape: 图片的大小
Returns
-------
boxes: 物体框的位置
"""
box_yx = box_xy[..., ::-1]
box_hw = box_wh[..., ::-1]
# 416,416
input_shape = tf.cast(input_shape, dtype = tf.float32)
# 实际图片的大小
image_shape = tf.cast(image_shape, dtype = tf.float32)
new_shape = tf.round(image_shape * tf.reduce_min(input_shape / image_shape))
offset = (input_shape - new_shape) / 2. / input_shape
scale = input_shape / new_shape
box_yx = (box_yx - offset) * scale
box_hw *= scale
box_mins = box_yx - (box_hw / 2.)
box_maxes = box_yx + (box_hw / 2.)
boxes = tf.concat([
box_mins[..., 0:1],
box_mins[..., 1:2],
box_maxes[..., 0:1],
box_maxes[..., 1:2]
], axis = -1)
boxes *= tf.concat([image_shape, image_shape], axis = -1)
return boxes
# 其实是解码的过程
def _get_feats(self, feats, anchors, num_classes, input_shape):
"""
Introduction
------------
根据yolo最后一层的输出确定bounding box
Parameters
----------
feats: yolo模型最后一层输出
anchors: anchors的位置
num_classes: 类别数量
input_shape: 输入大小
Returns
-------
box_xy, box_wh, box_confidence, box_class_probs
"""
num_anchors = len(anchors)
anchors_tensor = tf.reshape(tf.constant(anchors, dtype=tf.float32), [1, 1, 1, num_anchors, 2])
grid_size = tf.shape(feats)[1:3]
predictions = tf.reshape(feats, [-1, grid_size[0], grid_size[1], num_anchors, num_classes + 5])
# 这里构建13*13*1*2的矩阵,对应每个格子加上对应的坐标
grid_y = tf.tile(tf.reshape(tf.range(grid_size[0]), [-1, 1, 1, 1]), [1, grid_size[1], 1, 1])
grid_x = tf.tile(tf.reshape(tf.range(grid_size[1]), [1, -1, 1, 1]), [grid_size[0], 1, 1, 1])
grid = tf.concat([grid_x, grid_y], axis = -1)
grid = tf.cast(grid, tf.float32)
# 将x,y坐标归一化,相对网格的位置
box_xy = (tf.sigmoid(predictions[..., :2]) + grid) / tf.cast(grid_size[::-1], tf.float32)
# 将w,h也归一化
box_wh = tf.exp(predictions[..., 2:4]) * anchors_tensor / tf.cast(input_shape[::-1], tf.float32)
box_confidence = tf.sigmoid(predictions[..., 4:5])
box_class_probs = tf.sigmoid(predictions[..., 5:])
return box_xy, box_wh, box_confidence, box_class_probs
def eval(self, yolo_outputs, image_shape, max_boxes = 20):
"""
Introduction
------------
根据Yolo模型的输出进行非极大值抑制,获取最后的物体检测框和物体检测类别
Parameters
----------
yolo_outputs: yolo模型输出
image_shape: 图片的大小
max_boxes: 最大box数量
Returns
-------
boxes_: 物体框的位置
scores_: 物体类别的概率
classes_: 物体类别
"""
# 每一个特征层对应三个先验框
anchor_mask = [[6, 7, 8], [3, 4, 5], [0, 1, 2]]
boxes = []
box_scores = []
# inputshape是416x416
# image_shape是实际图片的大小
input_shape = tf.shape(yolo_outputs[0])[1 : 3] * 32
# 对三个特征层的输出获取每个预测box坐标和box的分数,score = 置信度x类别概率
#---------------------------------------#
# 对三个特征层解码
# 获得分数和框的位置
#---------------------------------------#
for i in range(len(yolo_outputs)):
_boxes, _box_scores = self.boxes_and_scores(yolo_outputs[i], self.anchors[anchor_mask[i]], len(self.class_names), input_shape, image_shape)
boxes.append(_boxes)
box_scores.append(_box_scores)
# 放在一行里面便于操作
boxes = tf.concat(boxes, axis = 0)
box_scores = tf.concat(box_scores, axis = 0)
mask = box_scores >= self.obj_threshold
max_boxes_tensor = tf.constant(max_boxes, dtype = tf.int32)
boxes_ = []
scores_ = []
classes_ = []
#---------------------------------------#
# 1、取出每一类得分大于self.obj_threshold
# 的框和得分
# 2、对得分进行非极大抑制
#---------------------------------------#
# 对每一个类进行判断
for c in range(len(self.class_names)):
# 取出所有类为c的box
class_boxes = tf.boolean_mask(boxes, mask[:, c])
# 取出所有类为c的分数
class_box_scores = tf.boolean_mask(box_scores[:, c], mask[:, c])
# 非极大抑制
nms_index = tf.image.non_max_suppression(class_boxes, class_box_scores, max_boxes_tensor, iou_threshold = self.nms_threshold)
# 获取非极大抑制的结果
class_boxes = tf.gather(class_boxes, nms_index)
class_box_scores = tf.gather(class_box_scores, nms_index)
classes = tf.ones_like(class_box_scores, 'int32') * c
boxes_.append(class_boxes)
scores_.append(class_box_scores)
classes_.append(classes)
boxes_ = tf.concat(boxes_, axis = 0)
scores_ = tf.concat(scores_, axis = 0)
classes_ = tf.concat(classes_, axis = 0)
return boxes_, scores_, classes_
#---------------------------------------#
# predict用于预测,分三步
# 1、建立yolo对象
# 2、获得预测结果
# 3、对预测结果进行处理
#---------------------------------------#
def predict(self, inputs, image_shape):
"""
Introduction
------------
构建预测模型
Parameters
----------
inputs: 处理之后的输入图片
image_shape: 图像原始大小
Returns
-------
boxes: 物体框坐标
scores: 物体概率值
classes: 物体类别
"""
model = yolo(config.norm_epsilon, config.norm_decay, self.anchors_path, self.classes_path, pre_train = False)
# yolo_inference用于获得网络的预测结果
output = model.yolo_inference(inputs, config.num_anchors // 3, config.num_classes, training = False)
boxes, scores, classes = self.eval(output, image_shape, max_boxes = 20)
return boxes, scores, classes
yolo3_model.py实现了yolo3的网络结构定义
# -*- coding:utf-8 -*-
import numpy as np
import tensorflow as tf
import os
class yolo:
def __init__(self, norm_epsilon, norm_decay, anchors_path, classes_path, pre_train):
"""
Introduction
------------
初始化函数
Parameters
----------
norm_decay: 在预测时计算moving average时的衰减率
norm_epsilon: 方差加上极小的数,防止除以0的情况
anchors_path: yolo anchor 文件路径
classes_path: 数据集类别对应文件
pre_train: 是否使用预训练darknet53模型
"""
self.norm_epsilon = norm_epsilon
self.norm_decay = norm_decay
self.anchors_path = anchors_path
self.classes_path = classes_path
self.pre_train = pre_train
self.anchors = self._get_anchors()
self.classes = self._get_class()
#---------------------------------------#
# 获取种类和先验框
#---------------------------------------#
def _get_class(self):
"""
Introduction
------------
获取类别名字
Returns
-------
class_names: coco数据集类别对应的名字
"""
classes_path = os.path.expanduser(self.classes_path)
with open(classes_path) as f:
class_names = f.readlines()
class_names = [c.strip() for c in class_names]
return class_names
def _get_anchors(self):
"""
Introduction
------------
获取anchors
"""
anchors_path = os.path.expanduser(self.anchors_path)
with open(anchors_path) as f:
anchors = f.readline()
anchors = [float(x) for x in anchors.split(',')]
return np.array(anchors).reshape(-1, 2)
#---------------------------------------#
# 用于生成层
#---------------------------------------#
# l2 正则化
def _batch_normalization_layer(self, input_layer, name = None, training = True, norm_decay = 0.99, norm_epsilon = 1e-3):
'''
Introduction
------------
对卷积层提取的feature map使用batch normalization
Parameters
----------
input_layer: 输入的四维tensor
name: batchnorm层的名字
trainging: 是否为训练过程
norm_decay: 在预测时计算moving average时的衰减率
norm_epsilon: 方差加上极小的数,防止除以0的情况
Returns
-------
bn_layer: batch normalization处理之后的feature map
'''
bn_layer = tf.layers.batch_normalization(inputs = input_layer,
momentum = norm_decay, epsilon = norm_epsilon, center = True,
scale = True, training = training, name = name)
return tf.nn.leaky_relu(bn_layer, alpha = 0.1)
# 这个就是用来进行卷积的
def _conv2d_layer(self, inputs, filters_num, kernel_size, name, use_bias = False, strides = 1):
"""
Introduction
------------
使用tf.layers.conv2d减少权重和偏置矩阵初始化过程,以及卷积后加上偏置项的操作
经过卷积之后需要进行batch norm,最后使用leaky ReLU激活函数
根据卷积时的步长,如果卷积的步长为2,则对图像进行降采样
比如,输入图片的大小为416*416,卷积核大小为3,若stride为2时,(416 - 3 + 2)/ 2 + 1, 计算结果为208,相当于做了池化层处理
因此需要对stride大于1的时候,先进行一个padding操作, 采用四周都padding一维代替'same'方式
Parameters
----------
inputs: 输入变量
filters_num: 卷积核数量
strides: 卷积步长
name: 卷积层名字
trainging: 是否为训练过程
use_bias: 是否使用偏置项
kernel_size: 卷积核大小
Returns
-------
conv: 卷积之后的feature map
"""
conv = tf.layers.conv2d(
inputs = inputs, filters = filters_num,
kernel_size = kernel_size, strides = [strides, strides], kernel_initializer = tf.glorot_uniform_initializer(),
padding = ('SAME' if strides == 1 else 'VALID'), kernel_regularizer = tf.contrib.layers.l2_regularizer(scale = 5e-4), use_bias = use_bias, name = name)
return conv
# 这个用来进行残差卷积的
# 残差卷积就是进行一次3X3的卷积,然后保存该卷积layer
# 再进行一次1X1的卷积和一次3X3的卷积,并把这个结果加上layer作为最后的结果
def _Residual_block(self, inputs, filters_num, blocks_num, conv_index, training = True, norm_decay = 0.99, norm_epsilon = 1e-3):
"""
Introduction
------------
Darknet的残差block,类似resnet的两层卷积结构,分别采用1x1和3x3的卷积核,使用1x1是为了减少channel的维度
Parameters
----------
inputs: 输入变量
filters_num: 卷积核数量
trainging: 是否为训练过程
blocks_num: block的数量
conv_index: 为了方便加载预训练权重,统一命名序号
weights_dict: 加载预训练模型的权重
norm_decay: 在预测时计算moving average时的衰减率
norm_epsilon: 方差加上极小的数,防止除以0的情况
Returns
-------
inputs: 经过残差网络处理后的结果
"""
# 在输入feature map的长宽维度进行padding
inputs = tf.pad(inputs, paddings=[[0, 0], [1, 0], [1, 0], [0, 0]], mode='CONSTANT')
layer = self._conv2d_layer(inputs, filters_num, kernel_size = 3, strides = 2, name = "conv2d_" + str(conv_index))
layer = self._batch_normalization_layer(layer, name = "batch_normalization_" + str(conv_index), training = training, norm_decay = norm_decay, norm_epsilon = norm_epsilon)
conv_index += 1
for _ in range(blocks_num):
shortcut = layer
layer = self._conv2d_layer(layer, filters_num // 2, kernel_size = 1, strides = 1, name = "conv2d_" + str(conv_index))
layer = self._batch_normalization_layer(layer, name = "batch_normalization_" + str(conv_index), training = training, norm_decay = norm_decay, norm_epsilon = norm_epsilon)
conv_index += 1
layer = self._conv2d_layer(layer, filters_num, kernel_size = 3, strides = 1, name = "conv2d_" + str(conv_index))
layer = self._batch_normalization_layer(layer, name = "batch_normalization_" + str(conv_index), training = training, norm_decay = norm_decay, norm_epsilon = norm_epsilon)
conv_index += 1
layer += shortcut
return layer, conv_index
#---------------------------------------#
# 生成_darknet53
#---------------------------------------#
def _darknet53(self, inputs, conv_index, training = True, norm_decay = 0.99, norm_epsilon = 1e-3):
"""
Introduction
------------
构建yolo3使用的darknet53网络结构
Parameters
----------
inputs: 模型输入变量
conv_index: 卷积层数序号,方便根据名字加载预训练权重
weights_dict: 预训练权重
training: 是否为训练
norm_decay: 在预测时计算moving average时的衰减率
norm_epsilon: 方差加上极小的数,防止除以0的情况
Returns
-------
conv: 经过52层卷积计算之后的结果, 输入图片为416x416x3,则此时输出的结果shape为13x13x1024
route1: 返回第26层卷积计算结果52x52x256, 供后续使用
route2: 返回第43层卷积计算结果26x26x512, 供后续使用
conv_index: 卷积层计数,方便在加载预训练模型时使用
"""
with tf.variable_scope('darknet53'):
# 416,416,3 -> 416,416,32
conv = self._conv2d_layer(inputs, filters_num = 32, kernel_size = 3, strides = 1, name = "conv2d_" + str(conv_index))
conv = self._batch_normalization_layer(conv, name = "batch_normalization_" + str(conv_index), training = training, norm_decay = norm_decay, norm_epsilon = norm_epsilon)
conv_index += 1
# 416,416,32 -> 208,208,64
conv, conv_index = self._Residual_block(conv, conv_index = conv_index, filters_num = 64, blocks_num = 1, training = training, norm_decay = norm_decay, norm_epsilon = norm_epsilon)
# 208,208,64 -> 104,104,128
conv, conv_index = self._Residual_block(conv, conv_index = conv_index, filters_num = 128, blocks_num = 2, training = training, norm_decay = norm_decay, norm_epsilon = norm_epsilon)
# 104,104,128 -> 52,52,256
conv, conv_index = self._Residual_block(conv, conv_index = conv_index, filters_num = 256, blocks_num = 8, training = training, norm_decay = norm_decay, norm_epsilon = norm_epsilon)
# route1 = 52,52,256
route1 = conv
# 52,52,256 -> 26,26,512
conv, conv_index = self._Residual_block(conv, conv_index = conv_index, filters_num = 512, blocks_num = 8, training = training, norm_decay = norm_decay, norm_epsilon = norm_epsilon)
# route2 = 26,26,512
route2 = conv
# 26,26,512 -> 13,13,1024
conv, conv_index = self._Residual_block(conv, conv_index = conv_index, filters_num = 1024, blocks_num = 4, training = training, norm_decay = norm_decay, norm_epsilon = norm_epsilon)
# route3 = 13,13,1024
return route1, route2, conv, conv_index
# 输出两个网络结果
# 第一个是进行5次卷积后,用于下一次逆卷积的,卷积过程是1X1,3X3,1X1,3X3,1X1
# 第二个是进行5+2次卷积,作为一个特征层的,卷积过程是1X1,3X3,1X1,3X3,1X1,3X3,1X1
def _yolo_block(self, inputs, filters_num, out_filters, conv_index, training = True, norm_decay = 0.99, norm_epsilon = 1e-3):
"""
Introduction
------------
yolo3在Darknet53提取的特征层基础上,又加了针对3种不同比例的feature map的block,这样来提高对小物体的检测率
Parameters
----------
inputs: 输入特征
filters_num: 卷积核数量
out_filters: 最后输出层的卷积核数量
conv_index: 卷积层数序号,方便根据名字加载预训练权重
training: 是否为训练
norm_decay: 在预测时计算moving average时的衰减率
norm_epsilon: 方差加上极小的数,防止除以0的情况
Returns
-------
route: 返回最后一层卷积的前一层结果
conv: 返回最后一层卷积的结果
conv_index: conv层计数
"""
conv = self._conv2d_layer(inputs, filters_num = filters_num, kernel_size = 1, strides = 1, name = "conv2d_" + str(conv_index))
conv = self._batch_normalization_layer(conv, name = "batch_normalization_" + str(conv_index), training = training, norm_decay = norm_decay, norm_epsilon = norm_epsilon)
conv_index += 1
conv = self._conv2d_layer(conv, filters_num = filters_num * 2, kernel_size = 3, strides = 1, name = "conv2d_" + str(conv_index))
conv = self._batch_normalization_layer(conv, name = "batch_normalization_" + str(conv_index), training = training, norm_decay = norm_decay, norm_epsilon = norm_epsilon)
conv_index += 1
conv = self._conv2d_layer(conv, filters_num = filters_num, kernel_size = 1, strides = 1, name = "conv2d_" + str(conv_index))
conv = self._batch_normalization_layer(conv, name = "batch_normalization_" + str(conv_index), training = training, norm_decay = norm_decay, norm_epsilon = norm_epsilon)
conv_index += 1
conv = self._conv2d_layer(conv, filters_num = filters_num * 2, kernel_size = 3, strides = 1, name = "conv2d_" + str(conv_index))
conv = self._batch_normalization_layer(conv, name = "batch_normalization_" + str(conv_index), training = training, norm_decay = norm_decay, norm_epsilon = norm_epsilon)
conv_index += 1
conv = self._conv2d_layer(conv, filters_num = filters_num, kernel_size = 1, strides = 1, name = "conv2d_" + str(conv_index))
conv = self._batch_normalization_layer(conv, name = "batch_normalization_" + str(conv_index), training = training, norm_decay = norm_decay, norm_epsilon = norm_epsilon)
conv_index += 1
route = conv
conv = self._conv2d_layer(conv, filters_num = filters_num * 2, kernel_size = 3, strides = 1, name = "conv2d_" + str(conv_index))
conv = self._batch_normalization_layer(conv, name = "batch_normalization_" + str(conv_index), training = training, norm_decay = norm_decay, norm_epsilon = norm_epsilon)
conv_index += 1
conv = self._conv2d_layer(conv, filters_num = out_filters, kernel_size = 1, strides = 1, name = "conv2d_" + str(conv_index), use_bias = True)
conv_index += 1
return route, conv, conv_index
# 返回三个特征层的内容
def yolo_inference(self, inputs, num_anchors, num_classes, training = True):
"""
Introduction
------------
构建yolo模型结构
Parameters
----------
inputs: 模型的输入变量
num_anchors: 每个grid cell负责检测的anchor数量
num_classes: 类别数量
training: 是否为训练模式
"""
conv_index = 1
# route1 = 52,52,256、route2 = 26,26,512、route3 = 13,13,1024
conv2d_26, conv2d_43, conv, conv_index = self._darknet53(inputs, conv_index, training = training, norm_decay = self.norm_decay, norm_epsilon = self.norm_epsilon)
with tf.variable_scope('yolo'):
#--------------------------------------#
# 获得第一个特征层
#--------------------------------------#
# conv2d_57 = 13,13,512,conv2d_59 = 13,13,255(3x(80+5))
conv2d_57, conv2d_59, conv_index = self._yolo_block(conv, 512, num_anchors * (num_classes + 5), conv_index = conv_index, training = training, norm_decay = self.norm_decay, norm_epsilon = self.norm_epsilon)
#--------------------------------------#
# 获得第二个特征层
#--------------------------------------#
conv2d_60 = self._conv2d_layer(conv2d_57, filters_num = 256, kernel_size = 1, strides = 1, name = "conv2d_" + str(conv_index))
conv2d_60 = self._batch_normalization_layer(conv2d_60, name = "batch_normalization_" + str(conv_index),training = training, norm_decay = self.norm_decay, norm_epsilon = self.norm_epsilon)
conv_index += 1
# unSample_0 = 26,26,256
unSample_0 = tf.image.resize_nearest_neighbor(conv2d_60, [2 * tf.shape(conv2d_60)[1], 2 * tf.shape(conv2d_60)[1]], name='upSample_0')
# route0 = 26,26,768
route0 = tf.concat([unSample_0, conv2d_43], axis = -1, name = 'route_0')
# conv2d_65 = 52,52,256,conv2d_67 = 26,26,255
conv2d_65, conv2d_67, conv_index = self._yolo_block(route0, 256, num_anchors * (num_classes + 5), conv_index = conv_index, training = training, norm_decay = self.norm_decay, norm_epsilon = self.norm_epsilon)
#--------------------------------------#
# 获得第三个特征层
#--------------------------------------#
conv2d_68 = self._conv2d_layer(conv2d_65, filters_num = 128, kernel_size = 1, strides = 1, name = "conv2d_" + str(conv_index))
conv2d_68 = self._batch_normalization_layer(conv2d_68, name = "batch_normalization_" + str(conv_index), training=training, norm_decay=self.norm_decay, norm_epsilon = self.norm_epsilon)
conv_index += 1
# unSample_1 = 52,52,128
unSample_1 = tf.image.resize_nearest_neighbor(conv2d_68, [2 * tf.shape(conv2d_68)[1], 2 * tf.shape(conv2d_68)[1]], name='upSample_1')
# route1= 52,52,384
route1 = tf.concat([unSample_1, conv2d_26], axis = -1, name = 'route_1')
# conv2d_75 = 52,52,255
_, conv2d_75, _ = self._yolo_block(route1, 128, num_anchors * (num_classes + 5), conv_index = conv_index, training = training, norm_decay = self.norm_decay, norm_epsilon = self.norm_epsilon)
return [conv2d_59, conv2d_67, conv2d_75]
utils.py包含了代码过程中用到的助手工具
import json
import numpy as np
import tensorflow as tf
from PIL import Image
from collections import defaultdict
def load_weights(var_list, weights_file):
"""
Introduction
------------
加载预训练好的darknet53权重文件
Parameters
----------
var_list: 赋值变量名
weights_file: 权重文件
Returns
-------
assign_ops: 赋值更新操作
"""
with open(weights_file, "rb") as fp:
_ = np.fromfile(fp, dtype=np.int32, count=5)
weights = np.fromfile(fp, dtype=np.float32)
ptr = 0
i = 0
assign_ops = []
while i < len(var_list) - 1:
var1 = var_list[i]
var2 = var_list[i + 1]
# do something only if we process conv layer
if 'conv2d' in var1.name.split('/')[-2]:
# check type of next layer
if 'batch_normalization' in var2.name.split('/')[-2]:
# load batch norm params
gamma, beta, mean, var = var_list[i + 1:i + 5]
batch_norm_vars = [beta, gamma, mean, var]
for var in batch_norm_vars:
shape = var.shape.as_list()
num_params = np.prod(shape)
var_weights = weights[ptr:ptr + num_params].reshape(shape)
ptr += num_params
assign_ops.append(tf.assign(var, var_weights, validate_shape=True))
# we move the pointer by 4, because we loaded 4 variables
i += 4
elif 'conv2d' in var2.name.split('/')[-2]:
# load biases
bias = var2
bias_shape = bias.shape.as_list()
bias_params = np.prod(bias_shape)
bias_weights = weights[ptr:ptr + bias_params].reshape(bias_shape)
ptr += bias_params
assign_ops.append(tf.assign(bias, bias_weights, validate_shape=True))
# we loaded 1 variable
i += 1
# we can load weights of conv layer
shape = var1.shape.as_list()
num_params = np.prod(shape)
var_weights = weights[ptr:ptr + num_params].reshape((shape[3], shape[2], shape[0], shape[1]))
# remember to transpose to column-major
var_weights = np.transpose(var_weights, (2, 3, 1, 0))
ptr += num_params
assign_ops.append(tf.assign(var1, var_weights, validate_shape=True))
i += 1
return assign_ops
def letterbox_image(image, size):
"""
Introduction
------------
对预测输入图像进行缩放,按照长宽比进行缩放,不足的地方进行填充
Parameters
----------
image: 输入图像
size: 图像大小
Returns
-------
boxed_image: 缩放后的图像
"""
image_w, image_h = image.size
w, h = size
new_w = int(image_w * min(w*1.0/image_w, h*1.0/image_h))
new_h = int(image_h * min(w*1.0/image_w, h*1.0/image_h))
resized_image = image.resize((new_w,new_h), Image.BICUBIC)
boxed_image = Image.new('RGB', size, (128, 128, 128))
boxed_image.paste(resized_image, ((w-new_w)//2,(h-new_h)//2))
return boxed_image
def draw_box(image, bbox):
"""
Introduction
------------
通过tensorboard把训练数据可视化
Parameters
----------
image: 训练数据图片
bbox: 训练数据图片中标记box坐标
"""
xmin, ymin, xmax, ymax, label = tf.split(value = bbox, num_or_size_splits = 5, axis=2)
height = tf.cast(tf.shape(image)[1], tf.float32)
weight = tf.cast(tf.shape(image)[2], tf.float32)
new_bbox = tf.concat([tf.cast(ymin, tf.float32) / height, tf.cast(xmin, tf.float32) / weight, tf.cast(ymax, tf.float32) / height, tf.cast(xmax, tf.float32) / weight], 2)
new_image = tf.image.draw_bounding_boxes(image, new_bbox)
tf.summary.image('input', new_image)
def voc_ap(rec, prec):
"""
--- Official matlab code VOC2012---
mrec=[0 ; rec ; 1];
mpre=[0 ; prec ; 0];
for i=numel(mpre)-1:-1:1
mpre(i)=max(mpre(i),mpre(i+1));
end
i=find(mrec(2:end)~=mrec(1:end-1))+1;
ap=sum((mrec(i)-mrec(i-1)).*mpre(i));
"""
rec.insert(0, 0.0) # insert 0.0 at begining of list
rec.append(1.0) # insert 1.0 at end of list
mrec = rec[:]
prec.insert(0, 0.0) # insert 0.0 at begining of list
prec.append(0.0) # insert 0.0 at end of list
mpre = prec[:]
for i in range(len(mpre) - 2, -1, -1):
mpre[i] = max(mpre[i], mpre[i + 1])
i_list = []
for i in range(1, len(mrec)):
if mrec[i] != mrec[i - 1]:
i_list.append(i)
ap = 0.0
for i in i_list:
ap += ((mrec[i] - mrec[i - 1]) * mpre[i])
return ap, mrec, mpre
config.py包含了参数的配置信息
num_parallel_calls = 4
input_shape = 416
max_boxes = 20
jitter = 0.3
hue = 0.1
sat = 1.0
cont = 0.8
bri = 0.1
norm_decay = 0.99
norm_epsilon = 1e-3
pre_train = True
num_anchors = 9
num_classes = 80
training = True
ignore_thresh = .5
learning_rate = 0.001
train_batch_size = 10
val_batch_size = 10
train_num = 2800
val_num = 5000
Epoch = 50
obj_threshold = 0.5
nms_threshold = 0.5
gpu_index = "0"
log_dir = './logs'
data_dir = './model_data'
model_dir = './test_model/model.ckpt-192192'
pre_train_yolo3 = True
yolo3_weights_path = './model_data/yolov3.weights'
darknet53_weights_path = './model_data/darknet53.weights'
anchors_path = './model_data/yolo_anchors.txt'
classes_path = './model_data/coco_classes.txt'
image_file = "./img/img.jpg"
detect.py为工程的入口主程序,包含了预处理和检测的主要流程
import os
import config
import argparse
import numpy as np
import tensorflow as tf
from yolo_predict import yolo_predictor
from PIL import Image, ImageFont, ImageDraw
from utils import letterbox_image, load_weights
# 指定使用GPU的Index
os.environ["CUDA_VISIBLE_DEVICES"] = config.gpu_index
def detect(image_path, model_path, yolo_weights = None):
"""
Introduction
------------
加载模型,进行预测
Parameters
----------
model_path: 模型路径,当使用yolo_weights无用
image_path: 图片路径
"""
#---------------------------------------#
# 图片预处理
#---------------------------------------#
image = Image.open(image_path)
# 对预测输入图像进行缩放,按照长宽比进行缩放,不足的地方进行填充
resize_image = letterbox_image(image, (416, 416))
image_data = np.array(resize_image, dtype = np.float32)
# 归一化
image_data /= 255.
# 转格式,第一维度填充
image_data = np.expand_dims(image_data, axis = 0)
#---------------------------------------#
# 图片输入
#---------------------------------------#
# input_image_shape原图的size
input_image_shape = tf.placeholder(dtype = tf.int32, shape = (2,))
# 图像
input_image = tf.placeholder(shape = [None, 416, 416, 3], dtype = tf.float32)
# 进入yolo_predictor进行预测,yolo_predictor是用于预测的一个对象
predictor = yolo_predictor(config.obj_threshold, config.nms_threshold, config.classes_path, config.anchors_path)
with tf.Session() as sess:
#---------------------------------------#
# 图片预测
#---------------------------------------#
if yolo_weights is not None:
with tf.variable_scope('predict'):
boxes, scores, classes = predictor.predict(input_image, input_image_shape)
# 载入模型
load_op = load_weights(tf.global_variables(scope = 'predict'), weights_file = yolo_weights)
sess.run(load_op)
# 进行预测
out_boxes, out_scores, out_classes = sess.run(
[boxes, scores, classes],
feed_dict={
# image_data这个resize过
input_image: image_data,
# 以y、x的方式传入
input_image_shape: [image.size[1], image.size[0]]
})
else:
boxes, scores, classes = predictor.predict(input_image, input_image_shape)
saver = tf.train.Saver()
saver.restore(sess, model_path)
out_boxes, out_scores, out_classes = sess.run(
[boxes, scores, classes],
feed_dict={
input_image: image_data,
input_image_shape: [image.size[1], image.size[0]]
})
#---------------------------------------#
# 画框
#---------------------------------------#
# 找到几个box,打印
print('Found {} boxes for {}'.format(len(out_boxes), 'img'))
font = ImageFont.truetype(font = 'font/FiraMono-Medium.otf', size = np.floor(3e-2 * image.size[1] + 0.5).astype('int32'))
# 厚度
thickness = (image.size[0] + image.size[1]) // 300
for i, c in reversed(list(enumerate(out_classes))):
# 获得预测名字,box和分数
predicted_class = predictor.class_names[c]
box = out_boxes[i]
score = out_scores[i]
# 打印
label = '{} {:.2f}'.format(predicted_class, score)
# 用于画框框和文字
draw = ImageDraw.Draw(image)
# textsize用于获得写字的时候,按照这个字体,要多大的框
label_size = draw.textsize(label, font)
# 获得四个边
top, left, bottom, right = box
top = max(0, np.floor(top + 0.5).astype('int32'))
left = max(0, np.floor(left + 0.5).astype('int32'))
bottom = min(image.size[1]-1, np.floor(bottom + 0.5).astype('int32'))
right = min(image.size[0]-1, np.floor(right + 0.5).astype('int32'))
print(label, (left, top), (right, bottom))
print(label_size)
if top - label_size[1] >= 0:
text_origin = np.array([left, top - label_size[1]])
else:
text_origin = np.array([left, top + 1])
# My kingdom for a good redistributable image drawing library.
for i in range(thickness):
draw.rectangle(
[left + i, top + i, right - i, bottom - i],
outline = predictor.colors[c])
draw.rectangle(
[tuple(text_origin), tuple(text_origin + label_size)],
fill = predictor.colors[c])
draw.text(text_origin, label, fill=(0, 0, 0), font=font)
del draw
image.show()
image.save('./img/result1.jpg')
if __name__ == '__main__':
# 当使用yolo3自带的weights的时候
if config.pre_train_yolo3 == True:
detect(config.image_file, config.model_dir, config.yolo3_weights_path)
# 当使用模型的时候
else:
detect(config.image_file, config.model_dir)
注意,工程中的预训练权重文件和COCO数据集等均可以在互联网上轻易找到
python detect.py --image_file ./img.jpg
测试图片如下:
效果: