论文阅读:(AAAI2020) 场景文本检测 DBNet + PaddleOCR源码对应

目录

        • 引言
        • DBNet整体说明
        • PaddleOCR中相对应的源码(以mv3_det为例)
          • backbone部分
          • neck部分
          • head部分
          • 损失部分
          • 后处理部分
        • 总结

引言

  • DBNet算法属于自底向上检测算法,而自底向上检测算法主要是借鉴传统的文本检测思想,先通过CNN检测出基本的文本组件,然后通过一些后处理方式将文本组件聚集成一个完整的文本实例。
  • 具体划分可以将自底向上的检测算法划分为基于分割的方法文本片段级别的方法,而DBNet属于基于分割的方法
  • 目前来看,DBNet可以说得上比较优秀得算法了,像PaddleOCR中也推荐使用该算法作为工业界中文本检测算法来使用,尽管有更准的模型情况下

DBNet整体说明

论文阅读:(AAAI2020) 场景文本检测 DBNet + PaddleOCR源码对应_第1张图片

  • DBNet因为是基于分割的思路做的文本检测,可以较为轻松地检测任意形状的文本,但是有着更为复杂的后处理
  • 根据源码可以分为三个部分backbone、neck、head,本文以基于MobileNetV3 Large作为backbone,同时添加FPN结构作为neck, 类似于U-Net结构,最后添加几层卷积作为head

PaddleOCR中相对应的源码(以mv3_det为例)

  • PaddleOCR release/2.0源码
backbone部分
  • 该部分主要搭建MobileNetV3主体结构代码,其中需要注意的是,因为后边需要采用FPN结构,所以相比于原始的网络结构,这里搭建过程中获得了四个分支的输出,用于后续FPN结构
# ppocr/modeling/backbones/det_mobilenet_v3.py
for (k, exp, c, se, nl, s) in cfg:
  se = se and not self.disable_se
  start_idx = 2 if model_name == 'large' else 0
  if s == 2 and i > start_idx:
      self.out_channels.append(inplanes)
      self.stages.append(nn.Sequential(*block_list))
      block_list = []
  # 这里添加不同的block_list
  block_list.append(
      ResidualUnit(
          in_channels=inplanes,
          mid_channels=make_divisible(scale * exp),
          out_channels=make_divisible(scale * c),
          kernel_size=k,
          stride=s,
          use_se=se,
          act=nl,
          name="conv" + str(i + 2)))
  inplanes = make_divisible(scale * c)
  i += 1
block_list.append(
  ConvBNLayer(
      in_channels=inplanes,
      out_channels=make_divisible(scale * cls_ch_squeeze),
      kernel_size=1,
      stride=1,
      padding=0,
      groups=1,
      if_act=True,
      act='hardswish',
      name='conv_last'))
self.stages.append(nn.Sequential(*block_list))
# ====================
# 以上代码段分别获得下面对应层的输出           
 if model_name == "large":
     cfg = [
          # k, exp, c,  se,     nl,  s,
          [3, 16, 16, False, 'relu', 1],
          [3, 64, 24, False, 'relu', 2],
          [3, 72, 24, False, 'relu', 1],  # 获得这一层的输出:c2
          [5, 72, 40, True, 'relu', 2],
          [5, 120, 40, True, 'relu', 1],
          [5, 120, 40, True, 'relu', 1],  # 获得这一层输出:c3
          [3, 240, 80, False, 'hardswish', 2],
          [3, 200, 80, False, 'hardswish', 1],
          [3, 184, 80, False, 'hardswish', 1],
          [3, 184, 80, False, 'hardswish', 1],
          [3, 480, 112, True, 'hardswish', 1],
          [3, 672, 112, True, 'hardswish', 1],  # 获得这一层输出: c4
          [5, 672, 160, True, 'hardswish', 2],
          [5, 960, 160, True, 'hardswish', 1],
          [5, 960, 160, True, 'hardswish', 1],  # 获得这一层输出: c5
      ]
  • 将以上获得的每一层输出c2, c3, c4, c5作为列表,输入到接下来的FPN中
neck部分
  • 该部分为FPN结构部分
# 图像输入尺寸为960x960
def forward(self, x):
   c2, c3, c4, c5 = x
   # c2 shape: [1, 16, 240, 240]
   # c3 shape: [1, 24, 120, 120]
   # c4 shape: [1, 56, 60, 60]
   # c5 shape: [1, 480, 30, 30]

   in5 = self.in5_conv(c5)
   in4 = self.in4_conv(c4)
   in3 = self.in3_conv(c3)
   in2 = self.in2_conv(c2)
   # in5 shape: [1, 96, 30, 30]
   # in4 shape: [1, 96, 60, 60]
   # in3 shape: [1, 96, 120, 120]
   # in2 shape: [1, 96, 240, 240]

   out4 = in4 + F.upsample(in5, scale_factor=2, mode="nearest", align_mode=1)  # 1/16
   out3 = in3 + F.upsample(out4, scale_factor=2, mode="nearest", align_mode=1)  # 1/8
   out2 = in2 + F.upsample(out3, scale_factor=2, mode="nearest", align_mode=1)  # 1/4
   # out4 shape: [1, 96, 60, 60]
   # out3 shape: [1, 96, 120, 120]
   # out2 shape: [1, 96, 240, 240]

   p5 = self.p5_conv(in5)
   p4 = self.p4_conv(out4)
   p3 = self.p3_conv(out3)
   p2 = self.p2_conv(out2)
   # p5 shape: [1, 24, 30, 30]
   # p4 shape: [1, 24, 60, 60]
   # p3 shape: [1, 24, 120, 120]
   # p2 shape: [1, 24, 240, 240]
   
   p5 = F.upsample(p5, scale_factor=8, mode="nearest", align_mode=1)
   p4 = F.upsample(p4, scale_factor=4, mode="nearest", align_mode=1)
   p3 = F.upsample(p3, scale_factor=2, mode="nearest", align_mode=1)
   # p5 shape: [1, 24, 240, 240]
   # p4 shape: [1, 24, 240, 240]
   # p3 shape: [1, 24, 240, 240]

   fuse = paddle.concat([p5, p4, p3, p2], axis=1)
   # fuse shape:  [1, 96, 240, 240]
   return fuse
head部分
  • 该部分主要是对应论文的probability mapthreshold map 一起生成approximate binary map 这部分
class DBHead(nn.Layer):
   """
   Differentiable Binarization (DB) for text detection:
       see https://arxiv.org/abs/1911.08947
   args:
       params(dict): super parameters for build DB network
   """

   def __init__(self, in_channels, k=50, **kwargs):
       super(DBHead, self).__init__()
       self.k = k 
       # 这里关于下面一些层的名称由来,暂时不明,后面也没有用到
       binarize_name_list = [
           'conv2d_56', 'batch_norm_47', 'conv2d_transpose_0', 'batch_norm_48',
           'conv2d_transpose_1', 'binarize'
       ]
       thresh_name_list = [
           'conv2d_57', 'batch_norm_49', 'conv2d_transpose_2', 'batch_norm_50',
           'conv2d_transpose_3', 'thresh'
       ]
       self.binarize = Head(in_channels, binarize_name_list)
       self.thresh = Head(in_channels, thresh_name_list)

   def step_function(self, x, y):
       return paddle.reciprocal(1 + paddle.exp(-self.k * (x - y)))

   def forward(self, x):
   	  # probability map
       shrink_maps = self.binarize(x)
       if not self.training:
           return {'maps': shrink_maps}
       
      # threshold map 
       threshold_maps = self.thresh(x)
       
       #  approximate binary map
       # 这里按照论文中的产生方法计算
       binary_maps = self.step_function(shrink_maps, threshold_maps)
       
       y = paddle.concat([shrink_maps, threshold_maps, binary_maps], axis=1)
       # x shape: [1, 96, 240, 240]
       # shrink_maps shape: [1, 1, 960, 960]
       # threshold_maps shape: [1, 1, 960, 960]
       # binary_maps shape: [1, 1, 960, 960]
       # y shape: [1, 3, 960, 960]
       return {'maps': y}
损失部分
  • 损失公式如下:
    L = L s + α × L b + β × L t L = L_{s} + \alpha\times L_{b} + \beta \times L_{t} L=Ls+α×Lb+β×Lt
  • 论文中对于probability map→ L s L_{s} Lsapproximate binary map→ L b L_{b} Lb采用的BCE Loss,并两个都做了正负样本1:3, threshold map→ L t L_{t} Lt采用的是L1 Loss
  • 但是PaddleOCR源码中,只有probability map的损失采用了正负样本1:3的策略
  • 同时论文公式(7)中的 α = 1.0 \alpha=1.0 α=1.0 β = 10 \beta=10 β=10,源码中 α = 5 \alpha=5 α=5, β = 10 \beta=10 β=10
class DBLoss(nn.Layer):
   """
   Differentiable Binarization (DB) Loss Function
   args:
       param (dict): the super paramter for DB Loss
   """

   def __init__(self,
                balance_loss=True,
                main_loss_type='DiceLoss',
                alpha=5,
                beta=10,
                ohem_ratio=3,
                eps=1e-6,
                **kwargs):
       super(DBLoss, self).__init__()
       self.alpha = alpha
       self.beta = beta
       self.dice_loss = DiceLoss(eps=eps)
       self.l1_loss = MaskL1Loss(eps=eps)
       self.bce_loss = BalanceLoss(
           balance_loss=balance_loss,
           main_loss_type=main_loss_type,
           negative_ratio=ohem_ratio)

   def forward(self, predicts, labels):
       predict_maps = predicts['maps']
       label_threshold_map, label_threshold_mask, label_shrink_map, label_shrink_mask = labels[
           1:]
       shrink_maps = predict_maps[:, 0, :, :]
       threshold_maps = predict_maps[:, 1, :, :]
       binary_maps = predict_maps[:, 2, :, :]

       loss_shrink_maps = self.bce_loss(shrink_maps, label_shrink_map,
                                        label_shrink_mask)
       loss_threshold_maps = self.l1_loss(threshold_maps, label_threshold_map,
                                          label_threshold_mask)
       loss_binary_maps = self.dice_loss(binary_maps, label_shrink_map,
                                         label_shrink_mask)
       loss_shrink_maps = self.alpha * loss_shrink_maps
       loss_threshold_maps = self.beta * loss_threshold_maps

       loss_all = loss_shrink_maps + loss_threshold_maps \
                  + loss_binary_maps
       losses = {'loss': loss_all, \
                 "loss_shrink_maps": loss_shrink_maps, \
                 "loss_threshold_maps": loss_threshold_maps, \
                 "loss_binary_maps": loss_binary_maps}
       return losses
后处理部分
  • 该部分主要利用一些图像处理知识和opencv,来获得最终的矩形框
  • 代码需要慢慢琢磨,先发表,再后续更新哈
class DBPostProcess(object):
    """
    The post process for Differentiable Binarization (DB).
    """

    def __init__(self,
                 thresh=0.3,
                 box_thresh=0.7,
                 max_candidates=1000,
                 unclip_ratio=2.0,
                 use_dilation=False,
                 **kwargs):
        self.thresh = thresh
        self.box_thresh = box_thresh
        self.max_candidates = max_candidates
        self.unclip_ratio = unclip_ratio
        self.min_size = 3
        self.dilation_kernel = None if not use_dilation else np.array(
            [[1, 1], [1, 1]])

    def boxes_from_bitmap(self, pred, _bitmap, dest_width, dest_height):
        '''
        _bitmap: single map with shape (1, H, W),
                whose values are binarized as {0, 1}
        '''

        bitmap = _bitmap
        height, width = bitmap.shape

        outs = cv2.findContours((bitmap * 255).astype(np.uint8),
                                cv2.RETR_LIST,
                                cv2.CHAIN_APPROX_SIMPLE)
        if len(outs) == 3:
            img, contours, _ = outs[0], outs[1], outs[2]
        elif len(outs) == 2:
            contours, _ = outs[0], outs[1]

        num_contours = min(len(contours), self.max_candidates)

        boxes = []
        scores = []
        for index in range(num_contours):
            contour = contours[index]
            points, sside = self.get_mini_boxes(contour)
            if sside < self.min_size:
                continue
            points = np.array(points)
            score = self.box_score_fast(pred, points.reshape(-1, 2))
            if self.box_thresh > score:
                continue

            box = self.unclip(points).reshape(-1, 1, 2)
            box, sside = self.get_mini_boxes(box)
            if sside < self.min_size + 2:
                continue
            box = np.array(box)

            box[:, 0] = np.clip(
                np.round(box[:, 0] / width * dest_width), 0, dest_width)
            box[:, 1] = np.clip(
                np.round(box[:, 1] / height * dest_height), 0, dest_height)
            boxes.append(box.astype(np.int16))
            scores.append(score)
        return np.array(boxes, dtype=np.int16), scores

    def unclip(self, box):
        unclip_ratio = self.unclip_ratio
        poly = Polygon(box)
        distance = poly.area * unclip_ratio / poly.length
        offset = pyclipper.PyclipperOffset()
        offset.AddPath(box, pyclipper.JT_ROUND, pyclipper.ET_CLOSEDPOLYGON)
        expanded = np.array(offset.Execute(distance))
        return expanded

    def get_mini_boxes(self, contour):
        bounding_box = cv2.minAreaRect(contour)
        points = sorted(list(cv2.boxPoints(bounding_box)),
                        key=lambda x: x[0])

        # TODO: 这里暂时不明白,需运行一下
        index_1, index_2, index_3, index_4 = 0, 1, 2, 3
        if points[1][1] > points[0][1]:
            index_1 = 0
            index_4 = 1
        else:
            index_1 = 1
            index_4 = 0

        if points[3][1] > points[2][1]:
            index_2 = 2
            index_3 = 3
        else:
            index_2 = 3
            index_3 = 2

        box = [
            points[index_1], points[index_2], points[index_3], points[index_4]
        ]
        return box, min(bounding_box[1])

    def box_score_fast(self, bitmap, _box):
        h, w = bitmap.shape[:2]
        box = _box.copy()
        xmin = np.clip(np.floor(box[:, 0].min()).astype(np.int), 0, w - 1)
        xmax = np.clip(np.ceil(box[:, 0].max()).astype(np.int), 0, w - 1)
        ymin = np.clip(np.floor(box[:, 1].min()).astype(np.int), 0, h - 1)
        ymax = np.clip(np.ceil(box[:, 1].max()).astype(np.int), 0, h - 1)

        mask = np.zeros((ymax - ymin + 1, xmax - xmin + 1), dtype=np.uint8)
        box[:, 0] = box[:, 0] - xmin
        box[:, 1] = box[:, 1] - ymin
        cv2.fillPoly(mask, box.reshape(1, -1, 2).astype(np.int32), 1)
        return cv2.mean(bitmap[ymin:ymax + 1, xmin:xmax + 1], mask)[0]

    def __call__(self, outs_dict, shape_list):
        pred = outs_dict['maps']
        if isinstance(pred, paddle.Tensor):
            pred = pred.numpy()
        pred = pred[:, 0, :, :]
        segmentation = pred > self.thresh

        boxes_batch = []
        for batch_index in range(pred.shape[0]):
            src_h, src_w, ratio_h, ratio_w = shape_list[batch_index]
            if self.dilation_kernel is not None:
                mask = cv2.dilate(
                    np.array(segmentation[batch_index]).astype(np.uint8),
                    self.dilation_kernel)
            else:
                mask = segmentation[batch_index]
            boxes, scores = self.boxes_from_bitmap(pred[batch_index], mask,
                                                   src_w, src_h)

            boxes_batch.append({'points': boxes})
        return boxes_batch

总结

  • PaddleOCR的确为良心之作,总体代码写得较为清楚,值得学习。

你可能感兴趣的:(深度学习,计算机视觉,文本检测)