前面(YOLO v3深入理解)讨论过论文和方案之后,现在看一下代码实现。YOLO原作者是C程序,这里选择的是Kears+Tensorflow版本,代码来自experiencor的git项目keras-yolo3,我补充了一些注释,项目在keras-yolo3 + 注释,如有错漏请指正。
图1 检测Raccoon
图2 输入->输出
参考上面图2,对于一个输入图像,比如416*416*3,相应的会输出 13*13*3 + 26*26*3 + 52*52*3 = 10647 个预测框。我们希望这些预测框的信息能够尽量准确的反应出哪些位置存在对象,是哪种对象,其边框位置在哪里。
在设置标签y(10647个预测框 * (4+1+类别数) 张量)的时候,YOLO的设计思路是,对于输入图像中的每个对象,该对象实际边框(groud truth)的中心落在哪个网格,就由该网格负责预测该对象。不过,由于设计了3种不同大小的尺度,每个网格又有3个先验框,所以对于一个对象中心点,可以对应9个先验框。但最终只选择与实际边框IOU最大的那个先验框负责预测该对象(该先验框的置信度=1),所有其它先验框都不负责预测该对象(置信度=0)。同时,该先验框所在的输出向量中,边框位置设置为对象实际边框,以及该对象类型设置为1。
图2 边框预测
训练样本设置参考 generator.py 中 class BatchGenerator。
loss计算参考 yolo.py 的 call(self, x)。
网络结构是 yolo.py 的 create_yolov3_model()。
x = input_image, y_pred, y_true, true_boxes
分别是:输入图像,YOLO输出的tensor,标签y(期望其输出的tensor),输入图像中所有ground truth box。
loss = 边框位置xy loss + 边框位置wh loss + 边框置信度loss + 对象分类loss
def call(self, x):
# true_boxes 对应 BatchGenerator 里面的 t_batch,shape=(batch,1,1,1,一个图像中最多几个对象,4个坐标)
# y_true 对应 BatchGenerator 里面的 yolo_1/yolo_2/yolo_3,即一个特征图tensor
input_image, y_pred, y_true, true_boxes = x
# adjust the shape of the y_predict [batch, grid_h, grid_w, 3, 4+1+nb_class]
# shape=(batch, 特征图高,特征图宽,3个anchor,4个边框坐标+1个置信度+检测对象类别数)
y_pred = tf.reshape(y_pred, tf.concat([tf.shape(y_pred)[:3], tf.constant([3, -1])], axis=0))
# initialize the masks
# object_mask 是一个特征图上所有预测框的置信度(objectness),这里来自标签y_true,除了负责检测对象的那些anchor,其它置信度都是0。
# shape = (batch, 特征图高,特征图宽,3个anchor,1个置信度)
# y_true[..., 4]提取边框置信度(最后一维tensor中,前4个是边框坐标,第5个就是置信度),expand_dims将其恢复到原来的tensor形状。
object_mask = tf.expand_dims(y_true[..., 4], 4)
# the variable to keep track of number of batches processed
batch_seen = tf.Variable(0.)
# compute grid factor and net factor
# 特征图的宽高
grid_h = tf.shape(y_true)[1]
grid_w = tf.shape(y_true)[2]
grid_factor = tf.reshape(tf.cast([grid_w, grid_h], tf.float32), [1,1,1,1,2])
# 输入图像的宽高
net_h = tf.shape(input_image)[1]
net_w = tf.shape(input_image)[2]
net_factor = tf.reshape(tf.cast([net_w, net_h], tf.float32), [1,1,1,1,2])
Adjust prediction
# pred_box_xy 是预测框在特征图上的中心点坐标,特征图网格大小归一化为1*1,=(sigma(t_xy) + c_xy)
pred_box_xy = (self.cell_grid[:,:grid_h,:grid_w,:,:] + tf.sigmoid(y_pred[..., :2])) # shape=(batch,特征图高,特征图宽,3预测框,2坐标)
# pred_box_wh 是预测对象的t_w, t_h。注:truth_wh = anchor_wh * exp(t_wh)
pred_box_wh = y_pred[..., 2:4] # shape=(batch,特征图高,特征图宽,3预测框,2坐标)
pred_box_conf = tf.expand_dims(tf.sigmoid(y_pred[..., 4]), 4) # shape=(batch,特征图高,特征图宽,3预测框,1confidence)
pred_box_class = y_pred[..., 5:] # shape=(batch,特征图高,特征图宽,3预测框,c个对象)
Adjust ground truth
# true_box_xy 是实际边框在特征图上的中心点坐标,=(sigma(t_xy) + c_xy),参见y_true
true_box_xy = y_true[..., 0:2] # shape=(batch,特征图高,特征图宽,3预测框,2坐标)
# true_box_wh 是对象的t_w, t_h。注:truth_wh = anchor_wh * exp(t_wh)
true_box_wh = y_true[..., 2:4] # shape=(batch,特征图高,特征图宽,3预测框,2坐标)
true_box_conf = tf.expand_dims(y_true[..., 4], 4) # shape=(batch,特征图高,特征图宽,3预测框,1confidence)
true_box_class = tf.argmax(y_true[..., 5:], -1) # shape=(batch,特征图高,特征图宽,3预测框)
Compare each predicted box to all true boxes
一个特征图上有 宽*高*3anchor 个预测框,YOLO的策略是,一个对象其中心点所在gird的3个anchor,IOU最大的那个anchor负责预测(其confidence=1)该对象。
# initially, drag all objectness of all boxes to 0
conf_delta = pred_box_conf - 0
# then, ignore the boxes which have good overlap with some true box
# true_xy,true_wh 的值是相当于将原始图像的宽高归一化为1*1
true_xy = true_boxes[..., 0:2] / grid_factor # shape=(batch,1,1,1,一个图像中最多几(3)个对象,2个xy坐标),xy是特征图上的坐标,与y_true中的xy一样
true_wh = true_boxes[..., 2:4] / net_factor # shape=(batch,1,1,1,一个图像中最多几(3)个对象,2个wh坐标),wh是原始图像上对象的宽和高
true_wh_half = true_wh / 2.
true_mins = true_xy - true_wh_half
true_maxes = true_xy + true_wh_half
pred_xy = tf.expand_dims(pred_box_xy / grid_factor, 4) # shape=(batch,特征图高,特征图宽,3预测框,1,2坐标)
pred_wh = tf.expand_dims(tf.exp(pred_box_wh) * self.anchors / net_factor, 4) # shape=(batch,特征图高,特征图宽,3预测框,1,2坐标)
pred_wh_half = pred_wh / 2.
pred_mins = pred_xy - pred_wh_half
pred_maxes = pred_xy + pred_wh_half
intersect_mins = tf.maximum(pred_mins, true_mins) # shape=(batch, 特征图高,特征图宽, 3预测框, 一个图像中最多几(3)个对象, 2个坐标)
intersect_maxes = tf.minimum(pred_maxes, true_maxes) # shape=(batch, 特征图高,特征图宽, 3预测框, 一个图像中最多几(3)个对象, 2个坐标)
intersect_wh = tf.maximum(intersect_maxes - intersect_mins, 0.) # shape=(batch, 特征图高,特征图宽, 3预测框, 一个图像中最多几(3)个对象, 2个坐标)
intersect_areas = intersect_wh[..., 0] * intersect_wh[..., 1] # shape=(batch, 特征图高,特征图宽, 3预测框, 一个图像中最多几(3)个对象)
true_areas = true_wh[..., 0] * true_wh[..., 1] # shape=(batch,1, 1, 1, 一个图像中最多几(3)个对象)
pred_areas = pred_wh[..., 0] * pred_wh[..., 1] # shape=(batch,特征图高,特征图宽,3预测框,1)
union_areas = pred_areas + true_areas - intersect_areas # shape=(batch, 特征图高,特征图宽, 3预测框, 一个图像中最多几(3)个对象)
iou_scores = tf.truediv(intersect_areas, union_areas) # shape=(batch, 特征图高,特征图宽, 3预测框, 一个图像中最多几(3)个对象)
# 每个预测框与最接近的实际对象的IOU
best_ious = tf.reduce_max(iou_scores, axis=4) # shape=(batch, 特征图高,特征图宽, 3预测框)
# IOU低于阈值的那些预测边框,才计算其(检测到背景的)置信度的loss
conf_delta *= tf.expand_dims(tf.to_float(best_ious < self.ignore_thresh), 4) # shape=(batch,特征图高,特征图宽,3预测框,1confidence)
Compute some online statistics
true_xy = true_box_xy / grid_factor
true_wh = tf.exp(true_box_wh) * self.anchors / net_factor
true_wh_half = true_wh / 2.
true_mins = true_xy - true_wh_half
true_maxes = true_xy + true_wh_half
pred_xy = pred_box_xy / grid_factor
pred_wh = tf.exp(pred_box_wh) * self.anchors / net_factor
pred_wh_half = pred_wh / 2.
pred_mins = pred_xy - pred_wh_half
pred_maxes = pred_xy + pred_wh_half
intersect_mins = tf.maximum(pred_mins, true_mins)
intersect_maxes = tf.minimum(pred_maxes, true_maxes)
intersect_wh = tf.maximum(intersect_maxes - intersect_mins, 0.)
intersect_areas = intersect_wh[..., 0] * intersect_wh[..., 1]
true_areas = true_wh[..., 0] * true_wh[..., 1]
pred_areas = pred_wh[..., 0] * pred_wh[..., 1]
union_areas = pred_areas + true_areas - intersect_areas
iou_scores = tf.truediv(intersect_areas, union_areas)
iou_scores = object_mask * tf.expand_dims(iou_scores, 4)
count = tf.reduce_sum(object_mask)
count_noobj = tf.reduce_sum(1 - object_mask)
detect_mask = tf.to_float((pred_box_conf*object_mask) >= 0.5)
class_mask = tf.expand_dims(tf.to_float(tf.equal(tf.argmax(pred_box_class, -1), true_box_class)), 4)
recall50 = tf.reduce_sum(tf.to_float(iou_scores >= 0.5 ) * detect_mask * class_mask) / (count + 1e-3)
recall75 = tf.reduce_sum(tf.to_float(iou_scores >= 0.75) * detect_mask * class_mask) / (count + 1e-3)
avg_iou = tf.reduce_sum(iou_scores) / (count + 1e-3)
avg_obj = tf.reduce_sum(pred_box_conf * object_mask) / (count + 1e-3)
avg_noobj = tf.reduce_sum(pred_box_conf * (1-object_mask)) / (count_noobj + 1e-3)
avg_cat = tf.reduce_sum(object_mask * class_mask) / (count + 1e-3)
Warm-up training
batch_seen = tf.assign_add(batch_seen, 1.)
true_box_xy, true_box_wh, xywh_mask = tf.cond(tf.less(batch_seen, self.warmup_batches+1),
# 根据YOLOv2开始的设计,前self.warmup_batches 个batch 计算的是预测框与先验框的误差,不是与真实对象边框的误差。
# 但这里代码好像有点问题。
lambda: [true_box_xy + (0.5 + self.cell_grid[:,:grid_h,:grid_w,:,:]) * (1-object_mask),
true_box_wh + tf.zeros_like(true_box_wh) * (1-object_mask), # zeros_like 导致后面的项为0,实际还是true_box_wh,需要修改
tf.ones_like(object_mask)], # 每个预测框的位置都计入loss
# 之后的batch不做特殊处理
lambda: [true_box_xy,
Compare each true box to all anchor boxes
# 注:exp(true_box_wh) = exp(t_wh) = truth_wh / anchor_wh
# exp(true_box_wh) * self.anchors / net_factor = truth_wh / anchor_wh * self.anchors / net_factor = truth_wh / net_factor
# wh_scale 是实际对象相对输入图像的大小。
wh_scale = tf.exp(true_box_wh) * self.anchors / net_factor # shape=(batch,特征图高,特征图宽,3anchor,2坐标)
# wh_scale 与实际对象边框的面积负相关,小尺寸对象对边框误差提升敏感度,the smaller the box, the bigger the scale
wh_scale = tf.expand_dims(2 - wh_scale[..., 0] * wh_scale[..., 1], axis=4)
# 正常情况下(warmup_batches之后),xywh_mask = object_mask,即存在对象的那些预测框(其位置、置信度、对象类型有意义)才计算loss。
# 不存在对象的那些预测框,其置信度有意义(不过conf_delta已过滤掉了那些IOU超过阈值的边框),计入loss。而位置和对象类型无意义,不计入loss。
xy_delta = xywh_mask * (pred_box_xy-true_box_xy) * wh_scale * self.xywh_scale # shape=(batch,特征图高,特征图宽,3个预测框,2个位置)
wh_delta = xywh_mask * (pred_box_wh-true_box_wh) * wh_scale * self.xywh_scale # shape=(batch,特征图高,特征图宽,3个预测框,2个位置)
# shape=(batch,特征图高,特征图宽,3个预测框,1个置信度),前一半是检测到对象的置信度,后一半是检测到背景的置信度
conf_delta = object_mask * (pred_box_conf-true_box_conf) * self.obj_scale + (1-object_mask) * conf_delta * self.noobj_scale
# shape=(batch,特征图高,特征图宽,3个预测框,1个交叉熵)
class_delta = object_mask * \
tf.expand_dims(tf.nn.sparse_softmax_cross_entropy_with_logits(labels=true_box_class, logits=pred_box_class), 4) * \
# shape=(batch_size,)
loss_xy = tf.reduce_sum(tf.square(xy_delta), list(range(1,5)))
loss_wh = tf.reduce_sum(tf.square(wh_delta), list(range(1,5)))
loss_conf = tf.reduce_sum(tf.square(conf_delta), list(range(1,5)))
loss_class = tf.reduce_sum(class_delta, list(range(1,5)))
loss = loss_xy + loss_wh + loss_conf + loss_class
loss = tf.Print(loss, [grid_h, avg_obj], message='avg_obj \t\t', summarize=1000)
loss = tf.Print(loss, [grid_h, avg_noobj], message='avg_noobj \t\t', summarize=1000)
loss = tf.Print(loss, [grid_h, avg_iou], message='avg_iou \t\t', summarize=1000)
loss = tf.Print(loss, [grid_h, avg_cat], message='avg_cat \t\t', summarize=1000)
loss = tf.Print(loss, [grid_h, recall50], message='recall50 \t\t', summarize=1000)
loss = tf.Print(loss, [grid_h, recall75], message='recall75 \t\t', summarize=1000)
loss = tf.Print(loss, [grid_h, count], message='count \t\t\t', summarize=1000)
loss = tf.Print(loss, [grid_h, tf.reduce_sum(loss_xy),
tf.reduce_sum(loss_class)], message='loss xy, wh, conf, class: \t', summarize=1000)
# loss 的shape=(batch_size,)
return loss*self.grid_scale