本文主要对常用的文本检测模型算法进行总结及分析,有的模型笔者切实run过,有的是通过论文及相关代码的分析,如有错误,请不吝指正。
一下进行各个模型的详细解析
代码链接:https://github.com/xiaofengShi/CHINESE-OCR
CTPN是目前应用非常广泛的印刷体文本检测模型算法。
CTPN由fasterrcnn改进而来,可以看下二者的异同
网络结构 | FasterRcnn | CTPN |
---|---|---|
basenet | Vgg16 ,Vgg19,resnet | Vgg16,也可以使用其他CNN结构 |
RPN预测 | basenet的predict layer使用CNN生成 | basenet之后使用双向RNN使用FC生成 |
ROI | 模型适用于目标检测,为多分类任务,包含ROI及类别损失和BOX回归 | 文本提取为二分类任务,不包含ROI及类别损失,只在RPN层计算目标损失及BOX回归 |
Anchor | 一共9种anchor尺寸,3比例,3尺寸 | 固定anchor宽度,高度为10种 |
batch | 每次只能训练一个样本 | 每次只能训练一个样本 |
根据ctpn的网络设计,可以看到看到ctpn一般使用预训练的vggnet,并且只用来检测水平文本,一般可以用来进行标准格式印刷体的检测,在目标框回归预测时,加上回归框的角度信息,就可以用来检测旋转文本,比如EAST模型。
网络模型
直接看CTPN
的网络代码
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 |
copy class VGGnet_train(Network): # 继承自NetWork,关与NetWork可以看这里:https://github.com/xiaofengShi/CHINESE-OCR/blob/master/ctpn/lib/networks/network.py def __init__(self, trainable=True): self.inputs = [] self.data = tf.placeholder(tf.float32, shape=[None, None, None, 3], name='data') self.im_info = tf.placeholder(tf.float32, shape=[None, 3], name='im_info') self.gt_boxes = tf.placeholder(tf.float32, shape=[None, 5], name='gt_boxes') self.gt_ishard = tf.placeholder(tf.int32, shape=[None], name='gt_ishard') self.dontcare_areas = tf.placeholder(tf.float32, shape=[None, 4], name='dontcare_areas') self.keep_prob = tf.placeholder(tf.float32) self.layers = dict({'data': self.data, 'im_info': self.im_info, 'gt_boxes': self.gt_boxes,'gt_ishard': self.gt_ishard, 'dontcare_areas': self.dontcare_areas}) self.trainable = trainable self.setup() def setup(self): # 对于文本提议来说,类别为2,一类为为文字部分,另一类为背景 n_classes = cfg.NCLASSES # anchor的初始尺寸,论文中使用的是16 anchor_scales = cfg.ANCHOR_SCALES _feat_stride = [16, ] # base net is vgg16 # 内部使用的函数 (self.feed('data') .conv(3, 3, 64, 1, 1, name='conv1_1') .conv(3, 3, 64, 1, 1, name='conv1_2') .max_pool(2, 2, 2, 2, padding='VALID', name='pool1') .conv(3, 3, 128, 1, 1, name='conv2_1') .conv(3, 3, 128, 1, 1, name='conv2_2') .max_pool(2, 2, 2, 2, padding='VALID', name='pool2') .conv(3, 3, 256, 1, 1, name='conv3_1') .conv(3, 3, 256, 1, 1, name='conv3_2') .conv(3, 3, 256, 1, 1, name='conv3_3') .max_pool(2, 2, 2, 2, padding='VALID', name='pool3') .conv(3, 3, 512, 1, 1, name='conv4_1') .conv(3, 3, 512, 1, 1, name='conv4_2') .conv(3, 3, 512, 1, 1, name='conv4_3') .max_pool(2, 2, 2, 2, padding='VALID', name='pool4') .conv(3, 3, 512, 1, 1, name='conv5_1') .conv(3, 3, 512, 1, 1, name='conv5_2') .conv(3, 3, 512, 1, 1, name='conv5_3')) # RPN # 该层对上层的feature map进行卷积,生成512通道的的feature map (self.feed('conv5_3').conv(3, 3, 512, 1, 1, name='rpn_conv/3x3')) # 卷积最后一层的的feature_map尺寸为batch*h*w*512 # 原来的单层双向LSTM (self.feed('rpn_conv/3x3').Bilstm(512, 128, 512, name='lstm_o')) # bilstm之后输出的尺寸为(N, H, W, 512) """ 和faster—rcnn相似,在ctpn的rpn网络中,使用双向lstm和全连接得到预测的 目标概率和回归框,在faster-rcnn中使用的是卷积的方式从basenet的最后一层生成 使用LSTM的输出来计算位置偏移和类别概率(判断是否是物体,不判断类别的种类) 输入尺寸为(N, H, W, 512) 输出尺寸(N, H, W, int(d_o)) 可以将这一层当做目标检测中的最后一层feature_map rpn_bbox_pred--对于h*w的尺寸上,每一anchor上生成4个位置偏移量 rpn_cls_score--对于h*w的尺寸上,每一anchor上生成2个置信度得分,判断是否为物体 """ (self.feed('lstm_o').lstm_fc(512, len(anchor_scales) * 10 * 4, name='rpn_bbox_pred')) (self.feed('lstm_o').lstm_fc(512, len(anchor_scales) * 10 * 2, name='rpn_cls_score')) # generating training labels on the fly # output: rpn_labels(HxWxA, 2) rpn_bbox_targets(HxWxA, 4) rpn_bbox_inside_weights rpn_bbox_outside_weights # 给每个anchor上标签,并计算真值(也是delta的形式),以及内部权重和外部权重 (self.feed('rpn_cls_score', 'gt_boxes', 'gt_ishard', 'dontcare_areas', 'im_info') .anchor_target_layer(_feat_stride, anchor_scales, name='rpn-data')) # shape is (1, H, W, Ax2) -> (1, H, WxA, 2) # 给之前得到的score进行softmax,得到0-1之间的得分 (self.feed('rpn_cls_score') .spatial_reshape_layer(2, name='rpn_cls_score_reshape') .spatial_softmax(name='rpn_cls_prob')) ''' # the below is the rcnn net model from faster_rcnn # 后面的部分是fasterrcnn之后的ROIPooling部分 (self.feed('rpn_cls_prob').spatial_reshape_layer(len(anchor_scales) * 10 * 2, name='rpn_cls_prob_reshape')) self.feed('rpn_cls_prob_reshape', 'rpn_bbox_pred', 'im_info').proposal_layer( _feat_stride, anchor_scales, 'TRAIN', name='rpn_rois') (self.feed('rpn_rois', 'gt_boxes').proposal_target_layer(n_classes, name='roi-data')) # ========= RCNN ============ (self.feed('conv5_3', 'roi-data').roi_pool(7, 7, 1.0/16, name='pool_5') .fc(4096, name='fc6').dropout(0.5, name='drop6') .fc(4096, name='fc7').dropout(0.5, name='drop7') .fc(n_classes, relu=False, name='cls_score').softmax(name='cls_prob')) (self.feed('drop7').fc(n_classes*4, relu=False, name='bbox_pred')) ''' |
可以看到CTPN
的网络结构有FasterRcnn
改变而来,使用vggnet
进行图像的特征提取,对得到的最后一层featuremap
的尺寸为[N,H,W,C][N,H,W,C],进行维度变换为[NH,W,C][NH,W,C]成为序列,使用BLSTM
得到的维度为[NH,W,2D][NH,W,2D]其中DD为单向RNN
的隐藏层节点数,转换维度为[NHW,2D][NHW,2D],使用全连接进行维度转换为[NHW,C][NHW,C],最后再reshape成[N,H,W,C][N,H,W,C],在这一步中,使用RNN
对CNN
之后的特征图进行特征图长度方向上的连接;接下来使用lstm_fc
函数对anchor
进行目标类别预测和边界回归框预测,在这一层的特征图上,每个点生成A个anchor
,每个anchor
存在目标类别预测和边界回归预测:对于回归预测,每个格点生成2A
个目标预测;对于边界回归预测,每个格点生成4A
个边界预测。
网络模型结构如下所示
CTPN MODEL STRUCTURE
anchor生成及筛选
在整个模型中,AnchorGen
处需要详细说明,这就是大名鼎鼎的RPN,下面结合代码说明:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 |
copy # -*- coding:utf-8 -*- import numpy as np import numpy.random as npr from ..fast_rcnn.config import cfg from bbox import bbox_overlaps, bbox_intersections DEBUG = False # 生成基础anchor box def generate_basic_anchors(sizes, base_size=16): base_anchor = np.array([0, 0, base_size - 1, base_size - 1], np.int32) anchors = np.zeros((len(sizes), 4), np.int32) index = 0 for h, w in sizes: anchors[index] = scale_anchor(base_anchor, h, w) index += 1 return anchors # 根据baseanchor和设定的anchor的高度和宽度进行设定的anchor生成 def scale_anchor(anchor, h, w): x_ctr = (anchor[0] + anchor[2]) * 0.5 y_ctr = (anchor[1] + anchor[3]) * 0.5 scaled_anchor = anchor.copy() scaled_anchor[0] = x_ctr - w / 2 # xmin scaled_anchor[2] = x_ctr + w / 2 # xmax scaled_anchor[1] = y_ctr - h / 2 # ymin scaled_anchor[3] = y_ctr + h / 2 # ymax return scaled_anchor # 生成anchor box # 此处使用的是宽度固定,高度不同的anchor设置 def generate_anchors(base_size=16, ratios=[0.5, 1, 2], scales=2 ** np.arange(3, 6)): heights = [11, 16, 23, 33, 48, 68, 97, 139, 198, 283] widths = [16] sizes = [] for h in heights: for w in widths: sizes.append((h, w)) return generate_basic_anchors(sizes) # 生成的anchor和groundtruth之间进行转换,转换方式和论文一致 def bbox_transform(ex_rois, gt_rois): """ computes the distance from ground-truth boxes to the given boxes, normed by their size :param ex_rois: n * 4 numpy array, anchor boxes :param gt_rois: n * 4 numpy array, ground-truth boxes :return: deltas: n * 4 numpy array, ground-truth boxes """ ex_widths = ex_rois[:, 2] - ex_rois[:, 0] + 1.0 # anchor width ex_heights = ex_rois[:, 3] - ex_rois[:, 1] + 1.0 # anchor height ex_ctr_x = ex_rois[:, 0] + 0.5 * ex_widths # anchor center x ex_ctr_y = ex_rois[:, 1] + 0.5 * ex_heights # anchor center y assert np.min(ex_widths) > 0.1 and np.min(ex_heights) > 0.1, \ 'Invalid boxes found: {} {}'. \ format(ex_rois[np.argmin(ex_widths), :], ex_rois[np.argmin(ex_heights), :]) gt_widths = gt_rois[:, 2] - gt_rois[:, 0] + 1.0 # gt_box width gt_heights = gt_rois[:, 3] - gt_rois[:, 1] + 1.0 # gt_box height gt_ctr_x = gt_rois[:, 0] + 0.5 * gt_widths # gt_box center x gt_ctr_y = gt_rois[:, 1] + 0.5 * gt_heights # gt_box center y # warnings.catch_warnings() # warnings.filterwarnings('error') targets_dx = (gt_ctr_x - ex_ctr_x) / ex_widths # (gt_c_x-a_c_x) targets_dy = (gt_ctr_y - ex_ctr_y) / ex_heights targets_dw = np.log(gt_widths / ex_widths) targets_dh = np.log(gt_heights / ex_heights) targets = np.vstack( (targets_dx, targets_dy, targets_dw, targets_dh)).transpose() return targets # 生成anchors def anchor_target_layer( rpn_cls_score, gt_boxes, gt_ishard, dontcare_areas, im_info, _feat_stride=[16, ], anchor_scales=[16, ]): """ Assign anchors to ground-truth targets. Produces anchor classification labels and bounding-box regression targets. Parameters ---------- rpn_cls_score: (1, H, W, Ax2) bg/fg scores of previous conv layer gt_boxes: (G, 5) vstack of [x1, y1, x2, y2, class] gt_ishard: (G, 1), 1 or 0 indicates difficult or not dontcare_areas: (D, 4), some areas may contains small objs but no labelling. D may be 0 im_info: a list of [image_height, image_width, scale_ratios] _feat_stride: the downsampling ratio of feature map to the original input image anchor_scales: the scales to the basic_anchor (basic anchor is [16, 16]) ---------- Returns ---------- rpn_labels : (HxWxA, 1), for each anchor, 0 denotes bg, 1 fg, -1 dontcare rpn_bbox_targets: (HxWxA, 4), distances of the anchors to the gt_boxes(may contains some transform) that are the regression objectives rpn_bbox_inside_weights: (HxWxA, 4) weights of each boxes, mainly accepts hyper param in cfg rpn_bbox_outside_weights: (HxWxA, 4) used to balance the fg/bg, beacuse the numbers of bgs and fgs mays significiantly different """ # anchors is the [x_min,y_min,x_max,y_max] # 生成基本的anchor,一共10个 _anchors = generate_anchors(scales=np.array(anchor_scales)) _num_anchors = _anchors.shape[0] # 10个anchor # allow boxes to sit over the edge by a small amount _allowed_border = 0 # 原始图像的信息,图像的高宽及通道数 im_info = im_info[0] # 在feature-map上定位anchor,并加上delta,得到在实际图像中anchor的真实坐标 """ Algorithm: for each (H, W) location i generate 9 anchor boxes centered on cell i apply predicted bbox deltas at cell i to each of the 9 anchors filter out-of-image anchors measure GT overlap """ assert rpn_cls_score.shape[0] == 1, \ 'Only single item batches are supported' # map of shape (..., H, W) height, width = rpn_cls_score.shape[1:3] # feature-map的高宽 # 1. Generate proposals from bbox deltas and shifted anchors shift_x = np.arange(0, width) * _feat_stride shift_y = np.arange(0, height) * _feat_stride shift_x, shift_y = np.meshgrid(shift_x, shift_y) # in W H order # 生成feature-map和真实图像上anchor之间的偏移量 # shifts构建网格结构,shape [height*width,4] shifts = np.vstack((shift_x.ravel(), shift_y.ravel(), shift_x.ravel(), shift_y.ravel())).transpose() A = _num_anchors # 10个anchor K = shifts.shape[0] # feature-map的宽乘高的大小 # 为当前的featuremap每个点生成A个anchor,shape is [K,A,4] all_anchors = (_anchors.reshape((1, A, 4)) + shifts.reshape((1, K, 4)).transpose((1, 0, 2))) all_anchors = all_anchors.reshape((K * A, 4)) # shape is (K*A,4) # 在featuremap上每个点生成A个anchor total_anchors = int(K * A) # only keep anchors inside the image # 因为生成的anchor尺寸有大有小,因此在边缘处生成的anchor有可能会超过原始图像的边界, # 将这些超过边界的anchor去掉,得到的是这些anchor的在all_anchors中的索引 # 仅保留那些还在图像内部的anchor,超出图像的都删掉 # anchors[:]=[x_min,y_min,x_max,y_max] inds_inside = np.where( (all_anchors[:, 0] >= -_allowed_border) & (all_anchors[:, 1] >= -_allowed_border) & (all_anchors[:, 2] < im_info[1] + _allowed_border) & # width (all_anchors[:, 3] < im_info[0] + _allowed_border) # height )[0] # keep only inside anchors anchors = all_anchors[inds_inside, :] # 保留那些在图像内的anchor # 至此,anchor准备好了 # -------------------------------------------------------------- # label: 1 is positive, 0 is negative, -1 is dont care # (A) labels = np.empty((len(inds_inside),), dtype=np.float32) labels.fill(-1) # 初始化label,均为-1 # overlaps between the anchors and the gt boxes # overlaps (ex, gt), shape is A x G # 计算anchor和gt-box的overlap,用来给anchor上标签 # anchor box and groundtruth box 交集面积/并集面积 # 通过IOU的得分来确定anchor为正样本与否 # overlaps shape is [anchor.shape[0],gt_box.shape[0]] overlaps = bbox_overlaps( np.ascontiguousarray(anchors, dtype=np.float), np.ascontiguousarray(gt_boxes, dtype=np.float)) # 存放每一个anchor和每一个gtbox之间的overlap # 找到和每一个gtbox,overlap最大的那个anchor argmax_overlaps = overlaps.argmax(axis=1) max_overlaps = overlaps[np.arange(len(inds_inside)), argmax_overlaps] # 找到每个位置上10个anchor中与gtbox,overlap最大的那个 gt_argmax_overlaps = overlaps.argmax(axis=0) gt_max_overlaps = overlaps[gt_argmax_overlaps, np.arange(overlaps.shape[1])] gt_argmax_overlaps = np.where(overlaps == gt_max_overlaps)[0] if not cfg.TRAIN.RPN_CLOBBER_POSITIVES: # assign bg labels first so that positive labels can clobber them # 先给背景上标签,小于0.3overlap的为负样本label为0 labels[max_overlaps < cfg.TRAIN.RPN_NEGATIVE_OVERLAP] = 0 # -----------------------------------# # 正样本的确定,iou得分大于0.7和每个位置上具有最大IOU得分的anchor # fg label: for each gt, anchor with highest overlap # 每个位置上的10个个anchor中overlap最大的认为是前景 labels[gt_argmax_overlaps] = 1 # fg label: above threshold IOU # overlap大于0.7的认为是前景 labels[max_overlaps >= cfg.TRAIN.RPN_POSITIVE_OVERLAP] = 1 if cfg.TRAIN.RPN_CLOBBER_POSITIVES: # assign bg labels last so that negative labels can clobber positives labels[max_overlaps < cfg.TRAIN.RPN_NEGATIVE_OVERLAP] = 0 # preclude dontcare areas # 这里我们暂时不考虑有doncare_area的存在 if dontcare_areas is not None and dontcare_areas.shape[0] > 0: # intersec shape is D x A intersecs = bbox_intersections( np.ascontiguousarray(dontcare_areas, dtype=np.float), # D x 4 np.ascontiguousarray(anchors, dtype=np.float) # A x 4 ) intersecs_ = intersecs.sum(axis=0) # A x 1 labels[intersecs_ > cfg.TRAIN.DONTCARE_AREA_INTERSECTION_HI] = -1 # 这里我们暂时不考虑难样本的问题 # preclude hard samples that are highly occlusioned, truncated or difficult to see if cfg.TRAIN.PRECLUDE_HARD_SAMPLES and gt_ishard is not None and gt_ishard.shape[0] > 0: assert gt_ishard.shape[0] == gt_boxes.shape[0] gt_ishard = gt_ishard.astype(int) gt_hardboxes = gt_boxes[gt_ishard == 1, :] if gt_hardboxes.shape[0] > 0: # H x A hard_overlaps = bbox_overlaps( np.ascontiguousarray(gt_hardboxes, dtype=np.float), # H x 4 np.ascontiguousarray(anchors, dtype=np.float)) # A x 4 hard_max_overlaps = hard_overlaps.max(axis=0) # (A) labels[hard_max_overlaps >= cfg.TRAIN.RPN_POSITIVE_OVERLAP] = -1 max_intersec_label_inds = hard_overlaps.argmax(axis=1) # H x 1 labels[max_intersec_label_inds] = -1 # # subsample positive labels if we have too many # 对正样本进行采样,如果正样本的数量太多的话 # 限制正样本的数量不超过128个,排除的置位dont_Care类 # TODO 这个后期可能还需要修改,毕竟如果使用的是字符的片段,那个正样本的数量是很多的。 num_fg = int(cfg.TRAIN.RPN_FG_FRACTION * cfg.TRAIN.RPN_BATCHSIZE) fg_inds = np.where(labels == 1)[0] if len(fg_inds) > num_fg: disable_inds = npr.choice( fg_inds, size=(len(fg_inds) - num_fg), replace=False) # 随机去除掉一些正样本 labels[disable_inds] = -1 # 变为-1 # subsample negative labels if we have too many # 对负样本进行采样,如果负样本的数量太多的话 # 正负样本总数是256,限制正样本数目最多128, # 如果正样本数量小于128,差的那些就用负样本补上,凑齐256个样本 num_bg = cfg.TRAIN.RPN_BATCHSIZE - np.sum(labels == 1) bg_inds = np.where(labels == 0)[0] if len(bg_inds) > num_bg: disable_inds = npr.choice( bg_inds, size=(len(bg_inds) - num_bg), replace=False) labels[disable_inds] = -1 # print "was %s inds, disabling %s, now %s inds" % ( # len(bg_inds), len(disable_inds), np.sum(labels == 0)) # 至此, 上好标签,开始计算rpn-box的真值 # -------------------------------------------------------------- bbox_targets = np.zeros((len(inds_inside), 4), dtype=np.float32) # 根据anchor和gtbox计算得真值(anchor和gtbox之间的偏差) bbox_targets = _compute_targets(anchors, gt_boxes[argmax_overlaps, :]) # 内部权重,前景就给1,其他是0 bbox_inside_weights = np.zeros((len(inds_inside), 4), dtype=np.float32) bbox_inside_weights[labels == 1, :] = np.array( cfg.TRAIN.RPN_BBOX_INSIDE_WEIGHTS) bbox_outside_weights = np.zeros((len(inds_inside), 4), dtype=np.float32) if cfg.TRAIN.RPN_POSITIVE_WEIGHT < 0: # 此处使用uniform权重,也就是正样本是1,负样本是0 # uniform weighting of examples (given non-uniform sampling) # num_examples = np.sum(labels >= 0) + 1 # positive_weights = np.ones((1, 4)) * 1.0 / num_examples # negative_weights = np.ones((1, 4)) * 1.0 / num_examples positive_weights = np.ones((1, 4)) # 前景为1 negative_weights = np.zeros((1, 4)) # 背景为0 else: assert ((cfg.TRAIN.RPN_POSITIVE_WEIGHT > 0) & (cfg.TRAIN.RPN_POSITIVE_WEIGHT < 1)) positive_weights = (cfg.TRAIN.RPN_POSITIVE_WEIGHT / (np.sum(labels == 1)) + 1) negative_weights = ((1.0 - cfg.TRAIN.RPN_POSITIVE_WEIGHT) / (np.sum(labels == 0)) + 1) # 外部权重,前景是1,背景是0 # bbox_outside_weights初始化为0,将label中为0的位置赋值bbox_outside_weights为0,labels为1的位置赋值为1 bbox_outside_weights[labels == 1, :] = positive_weights bbox_outside_weights[labels == 0, :] = negative_weights # map up to original set of anchors # 一开始是将超出图像范围的anchor直接丢掉的,现在在加回来 # inds_inside 是原始anchor中的索引 labels = _unmap(labels, total_anchors, inds_inside, fill=-1) # 这些anchor的label是-1,也即dontcare bbox_targets = _unmap(bbox_targets, total_anchors, inds_inside, fill=0) # 这些anchor的真值是0,也即没有值 bbox_inside_weights = _unmap(bbox_inside_weights, total_anchors, inds_inside, fill=0) # 内部权重以0填充 bbox_outside_weights = _unmap(bbox_outside_weights, total_anchors, inds_inside, fill=0) # 外部权重以0填充 # labels labels = labels.reshape((1, height, width, A)) # reshap一下label rpn_labels = labels # bbox_targets bbox_targets = bbox_targets.reshape((1, height, width, A * 4)) # reshape rpn_bbox_targets = bbox_targets # bbox_inside_weights bbox_inside_weights = bbox_inside_weights.reshape((1, height, width, A * 4)) rpn_bbox_inside_weights = bbox_inside_weights # bbox_outside_weights bbox_outside_weights = bbox_outside_weights.reshape((1, height, width, A * 4)) rpn_bbox_outside_weights = bbox_outside_weights rpn_data=(rpn_labels, rpn_bbox_targets, rpn_bbox_inside_weights, rpn_bbox_outside_weights) return rpn_data # 将排除掉边界之外的anchors之后的anchor补全回来 def _unmap(data, count, inds, fill=0): """ Unmap a subset of item (data) back to the original set of items (of size count) """ if len(data.shape) == 1: ret = np.empty((count,), dtype=np.float32) ret.fill(fill) ret[inds] = data else: ret = np.empty((count,) + data.shape[1:], dtype=np.float32) ret.fill(fill) ret[inds, :] = data return ret # 计算anchor和gt之间的矩形框的偏差 def _compute_targets(ex_rois, gt_rois): """Compute bounding-box regression targets for an image.""" assert ex_rois.shape[0] == gt_rois.shape[0] assert ex_rois.shape[1] == 4 assert gt_rois.shape[1] == 5 return bbox_transform(ex_rois, gt_rois[:, :4]).astype(np.float32, copy=False) |
对于bbox使用cpython写成(.pyx文件)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 |
copy import numpy as np cimport numpy as np DTYPE = np.float ctypedef np.float_t DTYPE_t # 计算IOU def bbox_overlaps( np.ndarray[DTYPE_t, ndim=2] boxes, np.ndarray[DTYPE_t, ndim=2] query_boxes): """ Parameters ---------- boxes: (N, 4) ndarray of float, anchor box nums query_boxes: (K, 4) ndarray of float, groud_truth object nums,[x_min,y_min,x_max,y_max,class] Returns ------- overlaps: (N, K) ndarray of overlap between boxes and query_boxes """ cdef unsigned int N = boxes.shape[0] cdef unsigned int K = query_boxes.shape[0] cdef np.ndarray[DTYPE_t, ndim=2] overlaps = np.zeros((N, K), dtype=DTYPE) cdef DTYPE_t iw, ih, box_area cdef DTYPE_t ua cdef unsigned int k, n for k in range(K): box_area = ( (query_boxes[k, 2] - query_boxes[k, 0] + 1) * (query_boxes[k, 3] - query_boxes[k, 1] + 1) ) for n in range(N): # 水平方向上的交集,如果存在那么iw为正 iw = ( min(boxes[n, 2], query_boxes[k, 2]) - max(boxes[n, 0], query_boxes[k, 0]) + 1 ) if iw > 0: # 竖直方向上的交集 ih = ( min(boxes[n, 3], query_boxes[k, 3]) - max(boxes[n, 1], query_boxes[k, 1]) + 1 ) if ih > 0: # 如果存在交集,计算并集的面积 # union area ua = float( (boxes[n, 2] - boxes[n, 0] + 1) * (boxes[n, 3] - boxes[n, 1] + 1) + box_area - iw * ih ) # 交集面积/并集面积 overlaps[n, k] = iw * ih / ua return overlaps # anchor与gt交集面积相对于gt面积的比例 def bbox_intersections( np.ndarray[DTYPE_t, ndim=2] boxes, np.ndarray[DTYPE_t, ndim=2] query_boxes): """ For each query box compute the intersection ratio covered by boxes ---------- Parameters ---------- boxes: (N, 4) ndarray of float query_boxes: (K, 4) ndarray of float Returns ------- overlaps: (N, K) ndarray of intersec between boxes and query_boxes """ cdef unsigned int N = boxes.shape[0] cdef unsigned int K = query_boxes.shape[0] cdef np.ndarray[DTYPE_t, ndim=2] intersec = np.zeros((N, K), dtype=DTYPE) cdef DTYPE_t iw, ih, box_area cdef DTYPE_t ua cdef unsigned int k, n for k in range(K): box_area = ( (query_boxes[k, 2] - query_boxes[k, 0] + 1) * (query_boxes[k, 3] - query_boxes[k, 1] + 1) ) for n in range(N): iw = ( min(boxes[n, 2], query_boxes[k, 2]) - max(boxes[n, 0], query_boxes[k, 0]) + 1 ) if iw > 0: ih = ( min(boxes[n, 3], query_boxes[k, 3]) - max(boxes[n, 1], query_boxes[k, 1]) + 1 ) if ih > 0: intersec[n, k] = iw * ih / box_area return intersec |
代码中的注释已经写得明明白白了。anchor生成函数为anchor_target_layer.py
Anchors
首先根据设定的anchor高度和宽度在特征图上每个cell生成A个anchors,这些anchors有的会超过原始图像的边界,如上图所示,将这些超出边界的anchors先删除,并记录保留的anchor在原始所有anchors中的索引值,使用内部的anchor和groundtruth进行IOU计算(anchor和gt之间如果存在交集,则使用交集面积和二者并集的面积进行IOU计算),使用两个原则进行anchor正样本的认定:如果anchor和gt之间的IOU大于设定的阈值0.7则认定该anchor为正样本;将具有和任意gt最大的IOU的anchor为正样本,也就是和gt最大的几个anchor最为正样本,这一步选择的anchor数量和gt的数量相同。至此就确定了正样本的anchor和剩余的负样本anchor,使用设定的正负样本数量,来控制正负样本的数量,将正负样本和和gt之间计算偏移量并作为目标框的label。对于anchor和gt之间的偏移量计算如下图所示
Anchor_groudtruth
图中红色表示groundtruth,黑色表示anchor box,首先计算两个矩形框的中心坐标和宽度高度,计算公式为
targetxtragetytragetwtrageth=(GTx−ANx)/ANwidth=(GTy−any)/ANheight=log(GTwidth/ANwidth)=log(GTheight/ANheight)targetx=(GTx−ANx)/ANwidthtragety=(GTy−any)/ANheighttragetw=log(GTwidth/ANwidth)trageth=log(GTheight/ANheight)
整个流程如下图所示
ctpn_anchor_gen
至此,对CTPN网络结构结合代码进行了一些跟人理解的解读,该模型与2016年提出,可以看到收到很多的fastercnn的影响,可以看到CTPN具有如下的一些特点
Locality-Aware NMS
的预测框过滤网络结构如下所示
EAST Model
先用一个通用的网络(论文中采用的是PVAnet,实际在使用的时候可以采用VGG16,Resnet等)作为base net ,用于特征提取
此处对PAVnet进行一些说明,PAVnet主要是对VGG进行了改进并应用于目标检测任务,主要针对FasterRcnn的基础网络进行了改进,包含
mCReLU,Inception,Hyper-feature
各个结构PVAnet
在论文总的基础网络用的是
PVAnet
的基础网络,具体参数如下所示PVAnetParam
对于mCReLU结构和Inception结构如下所示
PVAnet mCReLU Inception
基于上述主干特征提取网络,抽取不同层的featuremap
(它们的尺寸分别是inuput-image的132,116,18,14132,116,18,14,这样可以得到不同尺度的特征图,这样做的目的是解决文本行尺度变换剧烈的问题,ealy-stage可用于预测小的文本行(较大的特征图),late-stage可用于预测大的文本行(较小的特征图)。
特征合并层,将抽取的特征进行merge.这里合并的规则采用了Unet的方法,合并规则:从特征提取网络的顶部特征按照相应的规则向上进行合并,不断增大featuremap的尺寸。
网络输出层,包含文本得分和文本形状.根据不同文本形状(可分为RBOX和QUAD,对于RROX预测的是当前点距离gtbox的四个边的距离以及gtbox的相对图像的x正方向的角度θθ,也就是总共为5个值分别对应着(d1,d2,d3,d4,θ)(d1,d2,d3,d4,θ),而对于QUAD来说预测对应的gtbox的四个交点的坐标,一共8个值),对于RBOX对应的示意图如下所示
EAST_RBOX
图中的didi对应的是当前点到gt的距离,知道了一个固定点到矩形的四条边的距离,就可以的知道这个矩形所在的位置和大小,即确定这个矩形。
EAST_RBOX_QUAD
可以看出,对于RBOX输出5个预测值,而QUAD输出8个预测值。
对于层g和h的计算方式如图中公式所示。
进行特征图合并之后进行预测输出,也就是针对不同的box形式输出5个或者8个预测值。
总的损失包含分类损失和回归损失,即
L=LS+λgLgL=LS+λgLg
分类损失论文中使用的是平衡交叉熵损失
LS= balanced−xent(Y˙,Y)=−βYlogY˙−(1−β)(1−Y˙)(log(1−Y˙))whereβ=1−∑y∈Yy|Y|LS= balanced−xent(Y˙,Y)=−βYlogY˙−(1−β)(1−Y˙)(log(1−Y˙))whereβ=1−∑y∈Yy|Y|
其中Y˙Y˙为预测值,YY为label值。相比普通的交叉熵损失,平衡交叉熵损失对正负样本进行了平衡。
对于LgLg损失,由于在对于RBOX信息中包含的是5个预测值即(d1,d2,d3,d4,θ)(d1,d2,d3,d4,θ),那么就可以得到损失为
whereLg=LAABB+λθLθLAABB=−logIoU(R˙,R∗)=−log|R˙∩R∗||R˙∪R∗|Lθ=1−cos(θ˙−θ∗)Lg=LAABB+λθLθwhereLAABB=−logIoU(R˙,R∗)=−log|R˙∩R∗||R˙∪R∗|Lθ=1−cos(θ˙−θ∗)
对于IOU损失的计算是,论文中对交集区域面积的计算方式为
wi=min(d˙2,d∗2)+min(d˙4,d∗4)hi=min(d˙1,d∗1)+min(d˙3,d∗3)wi=min(d˙2,d2∗)+min(d˙4,d4∗)hi=min(d˙1,d1∗)+min(d˙3,d3∗)
实际上这种计算方式是存在问题的,分析如下
east_iou
如上图所示,红色对应gt,蓝色对应predict,如果不考虑角度,那么按照公式所述是正确的,但是考虑角度信息之后就会发现iou的交集面积计算公式存在错误。
综述
自然场景文本检测识别技术综述
白翔::图像OCR年度进展|VALSE2018之十一
白翔:趣谈“捕文捉字”— 场景文字检测 | VALSE2017之十
基于深度学习的目标检测及场景文字检测研究进展
知乎文本检测综述
优秀论文解读博客
知乎专栏:小石头的码疯窝
OCR_Overview_冠军试炼
文本检测
CTPN
场景文字检测—CTPN原理与实现
CTPN: Tensorflow
EAST
Bolg: EAST
知乎:文本检测之EAST
EAST:tensorflow
EAST: Keras
EAST: Advanced keras
SegLink
SegLink_Blog
文本检测之SegLink
PixelLink
文本检测之PixelLink
Github: PixelLink
TextBoxes
论文笔记:TextBoxes++: A Single-Shot Oriented Scene Text Detector
Github: TextBoxes++
角定位
基于角定位于区域分割
文本识别
ASTER
Github: ASTER
TextSpotter
Mask TextSpotter
华科白翔教授团队ECCV2018 OCR论文:Mask TextSpotter