目标检测0-02:YOLO V3-网络结构输入输出解析

    ├── checkpoint //保存模型的目录
    ├── convert_weight.py//对权重进行转换,为了模型的预训练
    ├── core//核心代码文件夹
    │   ├── backbone.py
    │   ├── common.py
    │   ├── config.py//配置文件
    │   ├── dataset.py//数据处理
    │   ├── __init__.py
    │   ├── utils.py
    │   └── yolov3.py//网络核心结构
    ├── data
    │   ├── anchors//预训练框
    │   │   ├── basline_anchors.txt
    │   │   └── coco_anchors.txt
    │   ├── classes//训练预测目标的种类
    │   │   ├── coco.names
    │   │   └── voc.names
    │   ├── dataset//保存图片的相关信息:路径,box,置信度,类别编号
    │   │   ├── voc_test.txt//测试数据
    │   │   └── voc_train.txt//训练数据
    ├── docs//比较混杂
    │   ├── Box-Clustering.ipynb//根据数据信息生成预选框anchors
    │   └── requirements.txt//环境搭建
    ├── evaluate.py//模型评估
    ├── freeze_graph.py//生成pb文件
    ├── image_demo.py//一张图片测试的demo
    ├── mAP//模型评估相关信息存储
    ├── README.md
    │   └── voc_annotation.py//把xml转化为网络可以使用的txt文件
    ├── train.py//模型训练
    └── video_demo.py//视屏测试的demo


二、YOLO V3输入


../VOC/train/VOCdevkit/VOC2007\JPEGImages\000005.jpg 263,211,324,339,8 165,264,253,372,8 241,194,295,299,8


再继续讲解之前,我们需要了解一下yolo V3的结构,yoloV3不同于之前的yolo1与yolo2,其使用了图像金字塔的思想,对一张图片进行了3次降采样,分别为8,16,32。表示我们输入的图像必须是32的倍数,不然没有办法进行32倍的降采样。如果对yolo1,与yolo2不熟悉,可以看看以下的文章:
目标检测0-02:YOLO V3-网络结构输入输出解析_第1张图片
这张图片的大小为(448,448),其被划分之后变成(7,7)。即每个每个gred都代表着(64,64)的视野。YOLO1会对每个gred进行两个box的检测,每个box所包含的信息如下:box的坐标(4个数字),该box物体的置信度(1个数字),该box所属于各种类别的概率(有多个类别就需要多好数字表示,如果VOC数据,则为20个)。所以我们最后要描述一张图像的信息需要(7,7,(4+1)*2+20)=(7,7,30)的矩阵。这样我们可以描述出来一张图片的信息,如下:目标检测0-02:YOLO V3-网络结构输入输出解析_第2张图片
上面是YOLO1的处理,对于YOLO3的原理也是类似的,只是YOLO3采用了图像金字他的思想,做了3次变化,如假设输入一张(416,416)的图片,经过3次(8,16,32)下采样变换之后为sacel[(52,52), (26,26), (13,13)]。可以这样理解,一张图片使用3种方式进行描述:8倍下采样得到特征图,每个网格可以代表原图种8个像素的感受野。16倍下采样得到特征图,每个网格可以代表原图种16个像素的感受野。32倍下采样得到特征图,每个网格可以代表原图种32个像素的感受野(重复了一些废话,大家不要介意)。

搞懂YOLO v1看这篇就够了


self.input_data   = tf.placeholder(dtype=tf.float32, name='input_data') #图片的像素

self.label_sbbox  = tf.placeholder(dtype=tf.float32, name='label_sbbox') 
self.label_mbbox  = tf.placeholder(dtype=tf.float32, name='label_mbbox') 
self.label_lbbox  = tf.placeholder(dtype=tf.float32, name='label_lbbox') 

self.true_sbboxes = tf.placeholder(dtype=tf.float32, name='sbboxes')
self.true_mbboxes = tf.placeholder(dtype=tf.float32, name='mbboxes')
self.true_lbboxes = tf.placeholder(dtype=tf.float32, name='lbboxes')

self.trainable     = tf.placeholder(dtype=tf.bool, name='training')


placeholder表示的是占位符,那么他的数据到底是怎么来的呢?根据train.py中的pbar = tqdm(self.trainset),一路追踪下去,可以知道其数据的预处理在core/dataset.py完成,其代码注释如下:

#! /usr/bin/env python
# coding=utf-8
#   Copyright (C) 2019 * Ltd. All rights reserved.
#   Editor      : VIM
#   File name   : dataset.py
#   Author      : YunYang1994
#   Created date: 2019-03-15 18:05:03
#   Description :

import os
import cv2
import random
import numpy as np
import tensorflow as tf
import core.utils as utils
from core.config import cfg

np.set_printoptions(suppress=True, threshold=np.nan)

class Dataset(object):
    """implement Dataset here"""
    def __init__(self, dataset_type):

        # 数据注释文件的路径,此处为"./data/dataset/voc_test.txt"
        self.annot_path  = cfg.TRAIN.ANNOT_PATH if dataset_type == 'train' else cfg.TEST.ANNOT_PATH

        # 数据输入图像的大小,为了增加网络的鲁棒性,使用了随机[320, 352, 384, 416, 448, 480, 512, 544, 576, 608]
        # 中任意一种大小,注意,该处必须为32的倍数
        self.input_sizes = cfg.TRAIN.INPUT_SIZE if dataset_type == 'train' else cfg.TEST.INPUT_SIZE

        # 数据的BATCH_SIZE,由于本人电脑不好,故设置为1
        self.batch_size  = cfg.TRAIN.BATCH_SIZE if dataset_type == 'train' else cfg.TEST.BATCH_SIZE

        # 是否开启AUG数据增强,该处为True
        self.data_aug    = cfg.TRAIN.DATA_AUG   if dataset_type == 'train' else cfg.TEST.DATA_AUG

        # 训练数据输入大小
        self.train_input_sizes = cfg.TRAIN.INPUT_SIZE

        # 3中下采样方式,为[8, 16, 32]
        self.strides = np.array(cfg.YOLO.STRIDES)

        # 训练数据的类别,使用VOC数据共20中,来自"./data/classes/voc.names"
        self.classes = utils.read_class_names(cfg.YOLO.CLASSES)

        # 种类的数目,针对VOC为20
        self.num_classes = len(self.classes)

        # 来自于"./data/anchors/basline_anchors.txt",该文件的生成于docs/Box-Clustering.ipynb
        self.anchors = np.array(utils.get_anchors(cfg.YOLO.ANCHORS))

        # 对每个gred(网格)预测几个box,该处为3
        self.anchor_per_scale = cfg.YOLO.ANCHOR_PER_SCALE

        # 一张图像中,允许存在最多的box数数目
        self.max_bbox_per_scale = 150

        # 加载数据,该处为训练数据,即"./data/classes/voc.names"内容
        self.annotations = self.load_annotations(dataset_type)

        # 计算训练样本的总数目
        self.num_samples = len(self.annotations)

        # 需要多个num_batchs才能完成一个EPOCHS
        self.num_batchs = int(np.ceil(self.num_samples / self.batch_size))

        # 用于batch的技术,达到num_batchs代表训练了一个EPOCHS
        self.batch_count = 0

    def load_annotations(self, dataset_type):
        :param dataset_type:
        with open(self.annot_path, 'r') as f:
            txt = f.readlines()
            annotations = [line.strip() for line in txt if len(line.strip().split()[1:]) != 0]
        return annotations

    def __iter__(self):
        return self

    def __next__(self):
        with tf.device('/cpu:0'):

            # 从给定的[320, 352, 384, 416, 448, 480, 512, 544, 576, 608]中随机选择大小
            # 为了方便讲解,假设每次随机到的大小为[416,416],注意,实际并非如此
            self.train_input_size = random.choice(self.train_input_sizes)

            # 获得3个输出图像的大小,分别为[8,16,32]下采样之后的大小,即得到[(52,52),(26,26),(13,13)]
            self.train_output_sizes = self.train_input_size // self.strides

            # 用于保存一个batch图片的像素,假设输入图像为[416,416]
            batch_image = np.zeros((self.batch_size, self.train_input_size, self.train_input_size, 3))

            # 存储下采样的实际box信息,这里是通过anchors变换的box,其包含了52x52个gred通过anchors得到的3个box,
            # 即对每张图片,进行[8,16,32]下采样,然后对其中的每个gred都画出3个box来,box的绘画根据anchors决定
            batch_label_sbbox = np.zeros((self.batch_size, self.train_output_sizes[0], self.train_output_sizes[0],
                                          self.anchor_per_scale, 5 + self.num_classes))
            batch_label_mbbox = np.zeros((self.batch_size, self.train_output_sizes[1], self.train_output_sizes[1],
                                          self.anchor_per_scale, 5 + self.num_classes))
            batch_label_lbbox = np.zeros((self.batch_size, self.train_output_sizes[2], self.train_output_sizes[2],
                                          self.anchor_per_scale, 5 + self.num_classes))

            # 保存实际的box,注意这里要和gred的box分开来,这里的box是总的box,不是gred的box,分别存储的也是对应[8,16,32]采样之后的
            batch_sbboxes = np.zeros((self.batch_size, self.max_bbox_per_scale, 4))
            batch_mbboxes = np.zeros((self.batch_size, self.max_bbox_per_scale, 4))
            batch_lbboxes = np.zeros((self.batch_size, self.max_bbox_per_scale, 4))

            num = 0 #对当前的图片计算,总数为batchs
            if self.batch_count < self.num_batchs:
                while num < self.batch_size: #对每张图片进行处理
                    # 得到图片的索引
                    index = self.batch_count * self.batch_size + num
                    if index >= self.num_samples: index -= self.num_samples

                    # 获取图片的信息,包含了图片路径,boxs,以及boxs对应的类别
                    annotation = self.annotations[index]

                    # 解析得到图片像素和boxs以及类别,其会随机对数据进行一些旋转,翻转等,增加数据的多样性
                    image, bboxes = self.parse_annotation(annotation)

                    # 该函数有详细注解
                    # label_sbbox, label_mbbox, label_lbbox:对每个grep都进行了描述,其中只选区包含了box中心的grep进行预测以及loss
                    # sbboxes, mbboxes, lbboxes,# 实际的boxs中心坐标和长宽
                    label_sbbox, label_mbbox, label_lbbox, sbboxes, mbboxes, lbboxes = self.preprocess_true_boxes(bboxes)

                    # 把处理过后的每张图片加到batch中
                    batch_image[num, :, :, :] = image
                    batch_label_sbbox[num, :, :, :, :] = label_sbbox
                    batch_label_mbbox[num, :, :, :, :] = label_mbbox
                    batch_label_lbbox[num, :, :, :, :] = label_lbbox
                    batch_sbboxes[num, :, :] = sbboxes
                    batch_mbboxes[num, :, :] = mbboxes
                    batch_lbboxes[num, :, :] = lbboxes
                    num += 1
                self.batch_count += 1
                return batch_image, batch_label_sbbox, batch_label_mbbox, batch_label_lbbox, \
                       batch_sbboxes, batch_mbboxes, batch_lbboxes
                self.batch_count = 0
                raise StopIteration

    def random_horizontal_flip(self, image, bboxes):

        if random.random() < 0.5:
            _, w, _ = image.shape
            image = image[:, ::-1, :]
            bboxes[:, [0,2]] = w - bboxes[:, [2,0]]

        return image, bboxes

    def random_crop(self, image, bboxes):

        if random.random() < 0.5:
            h, w, _ = image.shape
            max_bbox = np.concatenate([np.min(bboxes[:, 0:2], axis=0), np.max(bboxes[:, 2:4], axis=0)], axis=-1)

            max_l_trans = max_bbox[0]
            max_u_trans = max_bbox[1]
            max_r_trans = w - max_bbox[2]
            max_d_trans = h - max_bbox[3]

            crop_xmin = max(0, int(max_bbox[0] - random.uniform(0, max_l_trans)))
            crop_ymin = max(0, int(max_bbox[1] - random.uniform(0, max_u_trans)))
            crop_xmax = max(w, int(max_bbox[2] + random.uniform(0, max_r_trans)))
            crop_ymax = max(h, int(max_bbox[3] + random.uniform(0, max_d_trans)))

            image = image[crop_ymin : crop_ymax, crop_xmin : crop_xmax]

            bboxes[:, [0, 2]] = bboxes[:, [0, 2]] - crop_xmin
            bboxes[:, [1, 3]] = bboxes[:, [1, 3]] - crop_ymin

        return image, bboxes

    def random_translate(self, image, bboxes):

        if random.random() < 0.5:
            h, w, _ = image.shape
            max_bbox = np.concatenate([np.min(bboxes[:, 0:2], axis=0), np.max(bboxes[:, 2:4], axis=0)], axis=-1)

            max_l_trans = max_bbox[0]
            max_u_trans = max_bbox[1]
            max_r_trans = w - max_bbox[2]
            max_d_trans = h - max_bbox[3]

            tx = random.uniform(-(max_l_trans - 1), (max_r_trans - 1))
            ty = random.uniform(-(max_u_trans - 1), (max_d_trans - 1))

            M = np.array([[1, 0, tx], [0, 1, ty]])
            image = cv2.warpAffine(image, M, (w, h))

            bboxes[:, [0, 2]] = bboxes[:, [0, 2]] + tx
            bboxes[:, [1, 3]] = bboxes[:, [1, 3]] + ty

        return image, bboxes

    def parse_annotation(self, annotation):

        line = annotation.split()
        image_path = line[0]
        if not os.path.exists(image_path):
            raise KeyError("%s does not exist ... " %image_path)
        image = np.array(cv2.imread(image_path))
        bboxes = np.array([list(map(int, box.split(','))) for box in line[1:]])

        if self.data_aug:
            image, bboxes = self.random_horizontal_flip(np.copy(image), np.copy(bboxes))
            image, bboxes = self.random_crop(np.copy(image), np.copy(bboxes))
            image, bboxes = self.random_translate(np.copy(image), np.copy(bboxes))

        image, bboxes = utils.image_preporcess(np.copy(image), [self.train_input_size, self.train_input_size], np.copy(bboxes))
        return image, bboxes

    def bbox_iou(self, boxes1, boxes2):
        boxes1 = np.array(boxes1)
        boxes2 = np.array(boxes2)

        boxes1_area = boxes1[..., 2] * boxes1[..., 3]
        boxes2_area = boxes2[..., 2] * boxes2[..., 3]

        boxes1 = np.concatenate([boxes1[..., :2] - boxes1[..., 2:] * 0.5,
                                boxes1[..., :2] + boxes1[..., 2:] * 0.5], axis=-1)
        boxes2 = np.concatenate([boxes2[..., :2] - boxes2[..., 2:] * 0.5,
                                boxes2[..., :2] + boxes2[..., 2:] * 0.5], axis=-1)

        left_up = np.maximum(boxes1[..., :2], boxes2[..., :2])
        right_down = np.minimum(boxes1[..., 2:], boxes2[..., 2:])

        inter_section = np.maximum(right_down - left_up, 0.0)
        inter_area = inter_section[..., 0] * inter_section[..., 1]
        union_area = boxes1_area + boxes2_area - inter_area

        return inter_area / union_area

    def preprocess_true_boxes(self, bboxes):
        # 把原图像分别做8,16,32倍下采样变化。假设原始输入为416,416  3次下采样为
        # [(52,52,3,5+20),(26,26,3,5+20),(13,13,3,5+20)]
        label = [np.zeros((self.train_output_sizes[i], self.train_output_sizes[i], self.anchor_per_scale,
                           5 + self.num_classes)) for i in range(3)]

        # [(150,4),(150,4),(150,4)],存储实际得box,最多每张图片允许存在150个box
        bboxes_xywh = [np.zeros((self.max_bbox_per_scale, 4)) for _ in range(3)]

        # 对每个下采样图像得box数目进行计数
        bbox_count = np.zeros((3,))

        for bbox in bboxes: #对每一个box进行处理
            bbox_coor = bbox[:4]  # x_min, y_min, x_max, y_max
            bbox_class_ind = bbox[4] # 类别的标号

            # 转化为one_hot编码
            onehot = np.zeros(self.num_classes, dtype=np.float)
            onehot[bbox_class_ind] = 1.0

            # 对ohot概率进行平滑
            uniform_distribution = np.full(self.num_classes, 1.0 / self.num_classes)
            deta = 0.01
            smooth_onehot = onehot * (1 - deta) + deta * uniform_distribution

            # 获得原图中心点以及长宽
            bbox_xywh = np.concatenate([(bbox_coor[2:] + bbox_coor[:2]) * 0.5, bbox_coor[2:] - bbox_coor[:2]], axis=-1)

            # 先给bbox_xywh增加一个维度  按8,16,32下采样比例对中心点以及长宽进行缩放
            # bbox_xywh_scaled是包含3个下采样之后实际得boxs
            bbox_xywh_scaled = 1.0 * bbox_xywh[np.newaxis, :] / self.strides[:, np.newaxis]

            # 用来保存3个下采样的IOU
            iou = []

            # 标记为不存在正样本,如果图片中一个box都没有,认为该样本为负样本
            exist_positive = False
            for i in range(3): # 针对8,16,32下采样分别做处理
                # (3,4),对于每个网格会进行3次预测
                anchors_xywh = np.zeros((self.anchor_per_scale, 4))

                # 预测每次预测的中心点都是相同的
                anchors_xywh[:, 0:2] = np.floor(bbox_xywh_scaled[i, 0:2]).astype(np.int32) + 0.5

                # 赋值对应的anchors,这里为"./data/anchors/basline_anchors.txt"中
                # 1.25,  1.625,     2.0,   3.75,       4.125,    2.875,         i=0   8倍下采样
                # 1.875, 3.8125,    3.875, 2.8125,     3.6875,   7.4375,        i=2   16倍下采样
                # 3.625, 2.8125,    4.875, 6.1875,     11.65625, 10.1875        i=3   32倍下采样
                anchors_xywh[:, 2:4] = self.anchors[i]

                # bbox_xywh_scaled[i][np.newaxis, :]中的newaxis代表增加了一列
                # 该函数实际是利用预选的3个anchor与实际的box进行iou计算
                iou_scale = self.bbox_iou(bbox_xywh_scaled[i][np.newaxis, :], anchors_xywh)

                iou.append(iou_scale) # 添加一个下采样之后的iou_scale

                iou_mask = iou_scale > 0.3
                if np.any(iou_mask): # 3个预选框任意一个iou超过0.3表示其为一个正样本,然后进行处理
                    # 获取中心坐标
                    xind, yind = np.floor(bbox_xywh_scaled[i, 0:2]).astype(np.int32)
                    # 把中心gred特征向量清0
                    label[i][yind, xind, iou_mask, :] = 0
                    # 赋值真实xywh
                    label[i][yind, xind, iou_mask, 0:4] = bbox_xywh

                    # 类别置信度标记为1
                    label[i][yind, xind, iou_mask, 4:5] = 1.0

                    # 赋值每种类别的概率
                    label[i][yind, xind, iou_mask, 5:] = smooth_onehot

                    # 处理完当前下采样的一个box
                    bbox_ind = int(bbox_count[i] % self.max_bbox_per_scale)

                    # 赋值实际的bbox_xywh
                    bboxes_xywh[i][bbox_ind, :4] = bbox_xywh

                    # 下采样对应的bbox_count+1
                    bbox_count[i] += 1

                    # 表示该为一个正样本
                    exist_positive = True

            # 如果为负样本
            if not exist_positive:
                # 获得iou最好的anchor索引
                best_anchor_ind = np.argmax(np.array(iou).reshape(-1), axis=-1)

                best_detect = int(best_anchor_ind / self.anchor_per_scale)
                best_anchor = int(best_anchor_ind % self.anchor_per_scale)

                # 获得中心坐标
                xind, yind = np.floor(bbox_xywh_scaled[best_detect, 0:2]).astype(np.int32)

                # 把最好的的一下采样中最好的best_anchor进行赋值
                label[best_detect][yind, xind, best_anchor, :] = 0
                label[best_detect][yind, xind, best_anchor, 0:4] = bbox_xywh
                label[best_detect][yind, xind, best_anchor, 4:5] = 1.0
                label[best_detect][yind, xind, best_anchor, 5:] = smooth_onehot

                bbox_ind = int(bbox_count[best_detect] % self.max_bbox_per_scale)
                bboxes_xywh[best_detect][bbox_ind, :4] = bbox_xywh
                bbox_count[best_detect] += 1
        # 通过anchor变换的boxs
        label_sbbox, label_mbbox, label_lbbox = label
        # 真实的box
        sbboxes, mbboxes, lbboxes = bboxes_xywh
        return label_sbbox, label_mbbox, label_lbbox, sbboxes, mbboxes, lbboxes

    def __len__(self):
        return self.num_batch

通过源码的注释,我们可知道其中的label_sbbox, label_mbbox, label_lbbox, sbboxes, mbboxes, lbboxes分别为什么:

# 3个下采样对应图片每个gred的关于box的信息(只有存在box中心的gred才进行标记和预测)
label_sbbox, label_mbbox, label_lbbox 

sbboxes, mbboxes, lbboxes # 3个下采样,对应真实box位置



YOLO v3网络结构分析
目标检测0-02:YOLO V3-网络结构输入输出解析_第3张图片

# 3表示针对每个cell,预测3个box
# 4表示每个cell预测每box的xyhw,1表示每个置信度.
# 20表示分类的种类
[52,52,3*((4+1)+20)] = [52,52,75]
[26,26,3*((4+1)+20)] = [26,26,75]
[13,13,3*((4+1)+20)] = [13,13,75]

下小结我们讲解一下YOLO V3的损失函数,任何一个网络的核心,都在于其损失函数。
