YOLO系列笔记 -- V1

参考文章:
https://blog.csdn.net/qq_38232598/article/details/88695454
在这里插入图片描述

YOLO核心思想

·端到端:采用直接回归的思路,将目标定位和目标类别预测整合于单个神经网络模型中
·直接在输出层回归bbox的位置和所属类别 – 快!!!
YOLO系列笔记 -- V1_第1张图片
1.将一幅图像分为SXS个网格,如果某个object的中心落在网格中(通过ground-truth框确定),则对应网格就负责预测这个object
2.对于网络预测的Bounding box,除了回归自身位置外,还需要带一个confidence值,该值代表所预测的框中是否含有object和若含有object,该object预测的多准概率的乘积
YOLO系列笔记 -- V1_第2张图片
3、每个网格单元针对20种类别预测bboxes属于单个类别的条件概率P r ( Classi∣Object) ,属于同一个网格的B个bboxes共享一个条件概率。在测试时,将条件概率分别和单个的bbox的confidence预测相乘:
在这里插入图片描述

YOLO网络结构

YOLO系列笔记 -- V1_第3张图片
YOLO使用了24个级联卷积层和最后2个全连接层,1x1的卷积层目的是降低前面层的特征空间。在ImageNet任务中,使用分辨率为224x224的输入图像对卷积层进行预训练,然后将分辨率加倍进行目标检测

YOLO系列笔记 -- V1_第4张图片
bounding box预测的五个之分别是:框的坐标x,y值,框的宽度,长度,以及confidence值
因此输出的tensor张量计算公式为SxSx(5xB+C),(x,y)表示bounding box相对于网格单元的边界的offset,归一化到0-1的范围内,而w,h相对于整个图片的预测宽和高,前四个值都被归一化到0-1的范围内。c表示的则是object在bounding box的confidence值

归一化原理

假设图片划分为7x7个网格,宽为wi,高为hi
1.x,y–bbox的中心相对于单元格的offset
如下图所示,蓝色格的坐标为(Xcol = 1,Yrow = 4),假设预测输出时红色框bbox,设该框中心坐标为(xc,yc),则最终预测的(x,y)经归一化公式处理后,得到相对于单元格的offset
YOLO系列笔记 -- V1_第5张图片
YOLO系列笔记 -- V1_第6张图片
2. w,h – bbox相对于整张图片的比例
YOLO系列笔记 -- V1_第7张图片

Inference

每个grid cell对应的两个bounding box的5个值在tensor中体现出来,组成前十维。每个单元格负责预测一个object,但是每个单元格预测出两个bounding box
YOLO系列笔记 -- V1_第8张图片
tensor的前10维表示bounding box的信息,剩下20维则表示与类别相关的信息,其中对应维的值表示此object在此bbox中的时候,属于class_i的概率分数
YOLO系列笔记 -- V1_第9张图片
YOLO系列笔记 -- V1_第10张图片

YOLO系列笔记 -- V1_第11张图片
每个单元格预测一个属于类别classi的条件概率Pr(Classi|Object).言注意的是,属于一个网格的两个bbox共享一套条件概率,预测出一个object类别的最终值
在这里插入图片描述
根据上式,可以计算每个bbox的class-specific confidence score分数,对每个格子的每一个bounding box进行该运算,最后得到7x7x2=98个score(每个格子两个盒子),设置一个阈值,过滤掉低分的盒子,对保留的bbox进行NMS(Non Maximum Suppression)处理,最后得到检测结果
YOLO系列笔记 -- V1_第12张图片
YOLO系列笔记 -- V1_第13张图片

预测框的定位

经过上一步计算,可以得到每个bbox的class-specific confidence score
YOLO系列笔记 -- V1_第14张图片
· 首先通过设定阈值,过滤掉得分低的bboxs
YOLO系列笔记 -- V1_第15张图片
· 将所得分数从大到小进行排序
YOLO系列笔记 -- V1_第16张图片
YOLO系列笔记 -- V1_第17张图片
YOLO系列笔记 -- V1_第18张图片
YOLO系列笔记 -- V1_第19张图片
· 排序后,将得分最高的盒子标为bbox_max
YOLO系列笔记 -- V1_第20张图片
· 将bbox和其他分数较低但不为零的盒子(bbox_cur)做对比,主要是计算它们的重合度,具体做IOU计算,如果IOU(bbox_max,bbox_cur)> 0.5,则判定重合度过大,将bbox cur的分数设为0
YOLO系列笔记 -- V1_第21张图片
· 接着遍历下一个score,如果它不是最大的,且不为0,则和最大的score bbox做IOU运算(同上),如果补充和,则score不变,继续处理下一个bbox_cur
YOLO系列笔记 -- V1_第22张图片
· 计算完一轮后,得到新序列,找出score第二的bbox_cur,作为下一轮的bbox_max(如上图的0.2),循环计算后面的bbox_cur于新的bbox_max的IOU值(和之前步骤一致),一直循环直到最后一个score。上图最后得到的score排列为(0.5,0.2,0,…,0)
YOLO系列笔记 -- V1_第23张图片
· 其他类别的预测也一样
YOLO系列笔记 -- V1_第24张图片
YOLO系列笔记 -- V1_第25张图片

再筛选bounding box

经过前面NMS算法的处理,会有很多score为0的情况
最后,针对每个bbox的20个类别score:
1.找出bbox中score最大的类别的索引号,记为class
2.找出该bbox的最大score,记为score
3.判断score是否大于0,如果是,则在图像中画出标有class的框。否则,丢弃该bbox
YOLO系列笔记 -- V1_第26张图片
YOLO系列笔记 -- V1_第27张图片
·最终结果:
YOLO系列笔记 -- V1_第28张图片

损失函数

YOLO系列笔记 -- V1_第29张图片
在损失函数中,只有某个网格中有object的中心落入时,才对classification loss做惩罚,只有某个网格i中的bbox对某个ground-truth box负责时,才对coordinate error进行惩罚。那个bbox对ground-truth box负责通过IOU最大设定

· LOSS参数解释

1.在实际预测中,localization error和classfication error的重要性是不一样的,为了加大localization error对结果的影响,增加了λcoord权重,在Pascal VOC训练中取5
2.对于不存在object的bbox,一般赋予更小的损失权重,记为λnoobj,在Pascal VOC中取0.5。若没有物体中心落在框中,则置为0。0.5的意义在于解决因为没有权重而造成不平衡的现象
3.对于存在object的bbox的confidence loss和类别的loss,权重取常数1。对于每个格子而言,作者设计只能包含同种物体。若格子中包含物体,我们希望希望预测正确的类别的概率越接近于1越好,而错误类别的概率越接近于0越好。loss第4部分中,若p i ^ ( c ) 为0中c为正确类别,则值为1,若非正确类别,则值为0
4.wi和hi的误差取平方根计算原因在于,小物体的预测框偏离影响要比大物体偏离大,因此取平方根会加大小物体偏差的惩罚

YOLO V1的不足和优点

优点

1.假阳性(FP)低–YOLO网络将整个图像的全部信息作为上下文,结合了全图信息,因此能更好地区分目标和背景区域
2. 端到端(end-to-end),速度快 (tiny可以达到155fps)
3. 通用性强

缺点

1.性能表现比最先进的目标检测算法低
2.定位精确性差,小目标和密集群体定位效果不好(每个格子只能有一个box)
3.大小物体偏差还是难以处理
4.检测尺寸固定,分辨率固定
5、采用了多个下采样层,网络学到的物体特征不够精细

​代码解读

pascal_voc.py – 读入VOC数据集数据

import os
import xml.etree.ElementTree as ET
import numpy as np
import cv2
import pickle
import copy
import yolo.config as cfg


class pascal_voc(object):
    def __init__(self, phase, rebuild=False):
        #找VOC数据集数据
        self.devkil_path = os.path.join(cfg.PASCAL_PATH, 'VOCdevkit')
        self.data_path = os.path.join(self.devkil_path, 'VOC2007')
        #yolo/config.py设定的参数读取
        self.cache_path = cfg.CACHE_PATH    #初始化参数
        self.batch_size = cfg.BATCH_SIZE    #每次训练45个
        self.image_size = cfg.IMAGE_SIZE    #图片大小-448
        self.cell_size = cfg.CELL_SIZE      #每张图片划分的网格数--7*7
        self.classes = cfg.CLASSES          #所有类别字典数据
        self.class_to_ind = dict(zip(self.classes, range(len(self.classes))))
        self.flipped = cfg.FLIPPED          #是否图片翻转增强数据
        self.phase = phase                  #数据文件选择标志
        self.rebuild = rebuild              #标签解析对象文件是否重建标志
        self.cursor = 0
        self.epoch = 1                      #样本训练伦茨
        self.gt_labels = None               #标签列表字典对象
        self.prepare()                      #获得准备好的数据

    def get(self):      #依次获取图像及其标签文件,生成器
        images = np.zeros(
            (self.batch_size, self.image_size, self.image_size, 3))
        labels = np.zeros(
            (self.batch_size, self.cell_size, self.cell_size, 25))
        count = 0
        while count < self.batch_size:
            imname = self.gt_labels[self.cursor]['imname']
            flipped = self.gt_labels[self.cursor]['flipped']
            images[count, :, :, :] = self.image_read(imname, flipped)
            labels[count, :, :, :] = self.gt_labels[self.cursor]['label']
            count += 1
            self.cursor += 1
            if self.cursor >= len(self.gt_labels):  #获取完一轮gt_labels所有数据后,打乱顺序,重新获取
                np.random.shuffle(self.gt_labels)
                self.cursor = 0
                self.epoch += 1 #下一轮次
        return images, labels

    def image_read(self, imname, flipped=False):    #读取图片
        image = cv2.imread(imname)
        image = cv2.resize(image, (self.image_size, self.image_size))   #重新设置图片大小
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB).astype(np.float32)   #BGR转成三通道RGB格式
        image = (image / 255.0) * 2.0 - 1.0 #像素归一化到(-1,1) 数据格式为浮点型
        if flipped:
            image = image[:, ::-1, :] #图片翻转增强
        return image

    def prepare(self):      #完成标签数据准备的编码
        gt_labels = self.load_labels()  #将所有图像的信息提取出来
        if self.flipped:    #检测图像是否翻转,如果翻转要相应改变坐标位置
            print('Appending horizontally-flipped training examples ...') #水平翻转
            gt_labels_cp = copy.deepcopy(gt_labels)
            for idx in range(len(gt_labels_cp)):        #返回得到的可迭代数目
                gt_labels_cp[idx]['flipped'] = True
                gt_labels_cp[idx]['label'] =\
                    gt_labels_cp[idx]['label'][:, ::-1, :]      #水平翻转标签,这里反转的是cell网格,xc(cell)
                for i in range(self.cell_size):
                    for j in range(self.cell_size):
                        if gt_labels_cp[idx]['label'][i, j, 0] == 1:  #有物体才进行操作
                            gt_labels_cp[idx]['label'][i, j, 1] = \
                                self.image_size - 1 -\
                                gt_labels_cp[idx]['label'][i, j, 1]     #对标签的xc进行翻转,xc即图像的中心坐标(x.center)
            gt_labels += gt_labels_cp
        np.random.shuffle(gt_labels)        #再第0个维度进行打乱
        self.gt_labels = gt_labels
        return gt_labels

    def load_labels(self):      #读取标签对象=》列表字典对象
        cache_file = os.path.join(
            self.cache_path, 'pascal_' + self.phase + '_gt_labels.pkl')

        if os.path.isfile(cache_file) and not self.rebuild:
            print('Loading gt_labels from: ' + cache_file)
            with open(cache_file, 'rb') as f:
                gt_labels = pickle.load(f)      #将标签文件解析为对象
            return gt_labels        #不重建的化,直接return

        print('Processing gt_labels from: ' + self.data_path)

        if not os.path.exists(self.cache_path):
            os.makedirs(self.cache_path)

        if self.phase == 'train':
            txtname = os.path.join(
                self.data_path, 'ImageSets', 'Main', 'trainval.txt')
        else:
            txtname = os.path.join(
                self.data_path, 'ImageSets', 'Main', 'test.txt')
        with open(txtname, 'r') as f:
            self.image_index = [x.strip() for x in f.readlines()] #读取图片编号  strip()删除开头或结尾的字

        #将读取到的数据做成label,并存到gt_labels里
        gt_labels = []
        for index in self.image_index:
            label, num = self.load_pascal_annotation(index)
            if num == 0:
                continue
            imname = os.path.join(self.data_path, 'JPEGImages', index + '.jpg')
            gt_labels.append({'imname': imname,
                              'label': label,
                              'flipped': False})        #组成列表
        print('Saving gt_labels to: ' + cache_file)
        with open(cache_file, 'wb') as f:
            pickle.dump(gt_labels, f)       #序列化列表字典文件为对象
        return gt_labels

    def load_pascal_annotation(self, index):    #该方法的作用主要是从PASCAL VOC自带的xml中解析出bbox的坐标,并分配tensor参数
        """
        Load image and bounding boxes info from XML file in the PASCAL VOC
        format.
        """

        imname = os.path.join(self.data_path, 'JPEGImages', index + '.jpg')
        im = cv2.imread(imname)
        h_ratio = 1.0 * self.image_size / im.shape[0]  #因为重新设定了图像大小,因此实际坐标也要跟着改变
        w_ratio = 1.0 * self.image_size / im.shape[1]
        # im = cv2.resize(im, [self.image_size, self.image_size])

        label = np.zeros((self.cell_size, self.cell_size, 25))
        filename = os.path.join(self.data_path, 'Annotations', index + '.xml')  #找出xml配置文件
        tree = ET.parse(filename)
        objs = tree.findall('object')

        for obj in objs:
            bbox = obj.find('bndbox')
            # Make pixel indexes 0-based
            '''
                min() -- 让控制框不超过image_size,将超出边界的拉回来
                max() -- 当框坐标小于0是,将其拉回到图像内
                -1 == 让像素从0开始,索引也是从0开始
            '''
            x1 = max(min((float(bbox.find('xmin').text) - 1) * w_ratio, self.image_size - 1), 0)
            y1 = max(min((float(bbox.find('ymin').text) - 1) * h_ratio, self.image_size - 1), 0)
            x2 = max(min((float(bbox.find('xmax').text) - 1) * w_ratio, self.image_size - 1), 0)
            y2 = max(min((float(bbox.find('ymax').text) - 1) * h_ratio, self.image_size - 1), 0)
            cls_ind = self.class_to_ind[obj.find('name').text.lower().strip()] #让每一个类对应一个数字索引
            boxes = [(x2 + x1) / 2.0, (y2 + y1) / 2.0, x2 - x1, y2 - y1] #(x.center,y.center,width,height)
            x_ind = int(boxes[0] * self.cell_size / self.image_size) #分成7*7网格表示
            y_ind = int(boxes[1] * self.cell_size / self.image_size)
            #(confidence,x,y,width,height,class)
            if label[y_ind, x_ind, 0] == 1:     #同框多个obj时,一个cell只负责检测一个物体
                continue
            label[y_ind, x_ind, 0] = 1      #置信度
            label[y_ind, x_ind, 1:5] = boxes
            label[y_ind, x_ind, 5 + cls_ind] = 1

        return label, len(objs)#返回标签和tensor的实际维数

yolo_net.py – yolo网络搭建 + loss function

import numpy as np
import tensorflow as tf
import yolo.config as cfg

slim = tf.contrib.slim


class YOLONet(object):

    def __init__(self, is_training=True):
        self.classes = cfg.CLASSES  #分类字典
        self.num_class = len(self.classes)
        self.image_size = cfg.IMAGE_SIZE
        self.cell_size = cfg.CELL_SIZE
        self.boxes_per_cell = cfg.BOXES_PER_CELL
        self.output_size = (self.cell_size * self.cell_size) *\
            (self.num_class + self.boxes_per_cell * 5)
        self.scale = 1.0 * self.image_size / self.cell_size
        self.boundary1 = self.cell_size * self.cell_size * self.num_class
        self.boundary2 = self.boundary1 +\
            self.cell_size * self.cell_size * self.boxes_per_cell

        self.object_scale = cfg.OBJECT_SCALE
        self.noobject_scale = cfg.NOOBJECT_SCALE
        self.class_scale = cfg.CLASS_SCALE
        self.coord_scale = cfg.COORD_SCALE

        self.learning_rate = cfg.LEARNING_RATE
        self.batch_size = cfg.BATCH_SIZE
        self.alpha = cfg.ALPHA

        self.offset = np.transpose(np.reshape(np.array(
            [np.arange(self.cell_size)] * self.cell_size * self.boxes_per_cell),
            (self.boxes_per_cell, self.cell_size, self.cell_size)), (1, 2, 0))  #offset--单元划分后和图片的比值

        self.images = tf.placeholder(
            tf.float32, [None, self.image_size, self.image_size, 3],
            name='images')
        self.logits = self.build_network(
            self.images, num_outputs=self.output_size, alpha=self.alpha,
            is_training=is_training)

        if is_training:
            self.labels = tf.placeholder(
                tf.float32,
                [None, self.cell_size, self.cell_size, 5 + self.num_class])
            self.loss_layer(self.logits, self.labels)
            self.total_loss = tf.losses.get_total_loss()
            tf.summary.scalar('total_loss', self.total_loss)

    #定义yolo网络结构
    def build_network(self,
                      images,
                      num_outputs,
                      alpha,
                      keep_prob=0.5,
                      is_training=True,
                      scope='yolo'):
        with tf.variable_scope(scope):
            with slim.arg_scope(
                #定义要用的层
                [slim.conv2d, slim.fully_connected],
                activation_fn=leaky_relu(alpha),
                weights_regularizer=slim.l2_regularizer(0.0005),
                weights_initializer=tf.truncated_normal_initializer(0.0, 0.01)
            ):
                net = tf.pad(       #填充补数,作用是保证最后输出特征图大小是7*7
                    images, np.array([[0, 0], [3, 3], [3, 3], [0, 0]]),
                    name='pad_1')
                net = slim.conv2d(
                    net, 64, 7, 2, padding='VALID', scope='conv_2')  #7*7*64-s-2
                net = slim.max_pool2d(net, 2, padding='SAME', scope='pool_3')
                net = slim.conv2d(net, 192, 3, scope='conv_4')  #3*3*192
                net = slim.max_pool2d(net, 2, padding='SAME', scope='pool_5')
                net = slim.conv2d(net, 128, 1, scope='conv_6')
                net = slim.conv2d(net, 256, 3, scope='conv_7')
                net = slim.conv2d(net, 256, 1, scope='conv_8')
                net = slim.conv2d(net, 512, 3, scope='conv_9')
                net = slim.max_pool2d(net, 2, padding='SAME', scope='pool_10')
                net = slim.conv2d(net, 256, 1, scope='conv_11')
                net = slim.conv2d(net, 512, 3, scope='conv_12')
                net = slim.conv2d(net, 256, 1, scope='conv_13')
                net = slim.conv2d(net, 512, 3, scope='conv_14')
                net = slim.conv2d(net, 256, 1, scope='conv_15')
                net = slim.conv2d(net, 512, 3, scope='conv_16')
                net = slim.conv2d(net, 256, 1, scope='conv_17')
                net = slim.conv2d(net, 512, 3, scope='conv_18')
                net = slim.conv2d(net, 512, 1, scope='conv_19')
                net = slim.conv2d(net, 1024, 3, scope='conv_20')
                net = slim.max_pool2d(net, 2, padding='SAME', scope='pool_21')
                net = slim.conv2d(net, 512, 1, scope='conv_22')
                net = slim.conv2d(net, 1024, 3, scope='conv_23')
                net = slim.conv2d(net, 512, 1, scope='conv_24')
                net = slim.conv2d(net, 1024, 3, scope='conv_25')
                net = slim.conv2d(net, 1024, 3, scope='conv_26')
                net = tf.pad(
                    net, np.array([[0, 0], [1, 1], [1, 1], [0, 0]]),
                    name='pad_27')
                net = slim.conv2d(
                    net, 1024, 3, 2, padding='VALID', scope='conv_28')
                net = slim.conv2d(net, 1024, 3, scope='conv_29')
                net = slim.conv2d(net, 1024, 3, scope='conv_30')
                net = tf.transpose(net, [0, 3, 1, 2], name='trans_31')  #转置为[0,3,1,2]的排列
                net = slim.flatten(net, scope='flat_32')    #输入扁平化,但保留batch_size,假设第一维是batch
                net = slim.fully_connected(net, 512, scope='fc_33')
                net = slim.fully_connected(net, 4096, scope='fc_34')
                net = slim.dropout(
                    net, keep_prob=keep_prob, is_training=is_training,
                    scope='dropout_35')
                net = slim.fully_connected(
                    net, num_outputs, activation_fn=None, scope='fc_36')    #最后一层用线性激活函数
        return net      #最后输出的学习特征为7x7x30 = 1490,同时也能体现cellxcell*25等标签空间特征

    def calc_iou(self, boxes1, boxes2, scope='iou'):    #计算IOU
                                                        #将原始的中心点坐标和长宽,转换成矩形框左上角和右下角两个点坐标
        """calculate ious
        Args:
          boxes1: 5-D tensor [BATCH_SIZE, CELL_SIZE, CELL_SIZE, BOXES_PER_CELL, 4]  ====> (x_center, y_center, w, h)
          boxes2: 5-D tensor [BATCH_SIZE, CELL_SIZE, CELL_SIZE, BOXES_PER_CELL, 4] ===> (x_center, y_center, w, h)
        Return:
          iou: 4-D tensor [BATCH_SIZE, CELL_SIZE, CELL_SIZE, BOXES_PER_CELL]
        """
        with tf.variable_scope(scope):
            # transform (x_center, y_center, w, h) to (x1, y1, x2, y2) 1 -- 左上角 2--右下角
            boxes1_t = tf.stack([boxes1[..., 0] - boxes1[..., 2] / 2.0,
                                 boxes1[..., 1] - boxes1[..., 3] / 2.0,
                                 boxes1[..., 0] + boxes1[..., 2] / 2.0,
                                 boxes1[..., 1] + boxes1[..., 3] / 2.0],
                                axis=-1)

            boxes2_t = tf.stack([boxes2[..., 0] - boxes2[..., 2] / 2.0,
                                 boxes2[..., 1] - boxes2[..., 3] / 2.0,
                                 boxes2[..., 0] + boxes2[..., 2] / 2.0,
                                 boxes2[..., 1] + boxes2[..., 3] / 2.0],
                                axis=-1)

            # calculate the left up point & right down point       --  划定IOU区域(相交矩形区域)
            lu = tf.maximum(boxes1_t[..., :2], boxes2_t[..., :2])   #选坐标较大的左上角
            rd = tf.minimum(boxes1_t[..., 2:], boxes2_t[..., 2:])   #选坐标较小的右下角

            # intersection     --   计算相交区域大小
            intersection = tf.maximum(0.0, rd - lu)     #IOU的长和宽
            inter_square = intersection[..., 0] * intersection[..., 1]  #IOU区域面积,交集的大小

            # calculate the boxs1 square and boxs2 square
            square1 = boxes1[..., 2] * boxes1[..., 3]
            square2 = boxes2[..., 2] * boxes2[..., 3]

            union_square = tf.maximum(square1 + square2 - inter_square, 1e-10) #两个框的真实面积(减去一次重叠部分),也就是并集的大小

        return tf.clip_by_value(inter_square / union_square, 0.0, 1.0)  #计算出两个框的交并比,并规定在0-1之间

    def loss_layer(self, predicts, labels, scope='loss_layer'):
        with tf.variable_scope(scope):
            predict_classes = tf.reshape(
                predicts[:, :self.boundary1],
                [self.batch_size, self.cell_size, self.cell_size, self.num_class])  #预测类别
            predict_scales = tf.reshape(
                predicts[:, self.boundary1:self.boundary2],
                [self.batch_size, self.cell_size, self.cell_size, self.boxes_per_cell])     #预测置信度
            predict_boxes = tf.reshape(
                predicts[:, self.boundary2:],
                [self.batch_size, self.cell_size, self.cell_size, self.boxes_per_cell, 4])  #预测框的信息,一个cell两个盒子,所以占8维

            #拿ground-truth的值,准备计算损失函数
            response = tf.reshape(                              #responce--提取label中的置信度,表示该地方是否有狂
                labels[..., 0],
                [self.batch_size, self.cell_size, self.cell_size, 1])
            boxes = tf.reshape(
                labels[..., 1:5],
                [self.batch_size, self.cell_size, self.cell_size, 1, 4])    #升维度,提取label的框信息 盒子在前5维
            # tf.tile 对boxes张量进行赋值,扩张规则为第二个参数(对轴3赋值两次,因为一个cell两个框)
            boxes = tf.tile(
                boxes, [1, 1, 1, self.boxes_per_cell, 1]) / self.image_size #进行Xc/image_size的归一化操作,
                                                                            # 为了匹配predict_scales/self.image_size,处理wi,hi
            classes = labels[..., 5:]   #5维后为分类类别,提取label的类

            offset = tf.reshape(        #offsetz作用是将每个cell的坐标对齐
                tf.constant(self.offset, dtype=tf.float32),
                [1, self.cell_size, self.cell_size, self.boxes_per_cell])   #offset = none,cell,cell,per_Cell(四维).Cell能取0-6
            offset = tf.tile(offset, [self.batch_size, 1, 1, 1])    #扩张为batch_size的大小,表示一批
            offset_tran = tf.transpose(offset, (0, 2, 1, 3))    #转置,将每个cell的坐标对齐,同时赋一个偏置初值(0-1)之间
            predict_boxes_tran = tf.stack(  #矩阵拼接
                [(predict_boxes[..., 0] + offset) / self.cell_size,
                 (predict_boxes[..., 1] + offset_tran) / self.cell_size,
                 tf.square(predict_boxes[..., 2]),                  #xc = (x(相对)+xcol)*wi/S
                 tf.square(predict_boxes[..., 3])], axis=-1)        #初始化  predict_boxes_tram[...,:]

            iou_predict_truth = self.calc_iou(predict_boxes_tran, boxes)  #iou_predict_truth  [BATCH_SIZE, CELL_SIZE, CELL_SIZE, BOXES_PER_CELL]

            # calculate I tensor [BATCH_SIZE, CELL_SIZE, CELL_SIZE, BOXES_PER_CELL],计算有obj的区域,就是猜中了的地方
            object_mask = tf.reduce_max(iou_predict_truth, 3, keep_dims=True)   #得到交并比较大的框
            object_mask = tf.cast(      #tf.cast--点乘  高维矩阵没有叉乘
                (iou_predict_truth >= object_mask), tf.float32) * response

            # calculate no_I tensor [CELL_SIZE, CELL_SIZE, BOXES_PER_CELL],计算noobj区域
            noobject_mask = tf.ones_like(
                object_mask, dtype=tf.float32) - object_mask    #ones_like将值全为1,再减有目标的,等于没目标的

            boxes_tran = tf.stack(      #计算bbox的相对位置信息
                [boxes[..., 0] * self.cell_size - offset,   #  (Xc*S-Xcol)/wi =x =>相对值  wi在 185行处理的
                 boxes[..., 1] * self.cell_size - offset_tran,  #y一样,wi,hi前面已经处理过
                 tf.sqrt(boxes[..., 2]),
                 tf.sqrt(boxes[..., 3])], axis=-1)  #对于损失函数,w,h要开方

            #开始计算损失
            # class_loss
            class_delta = response * (predict_classes - classes) #置信度*类别差
            class_loss = tf.reduce_mean(
                tf.reduce_sum(tf.square(class_delta), axis=[1, 2, 3]),
                name='class_loss') * self.class_scale

            # object_loss
            object_delta = object_mask * (predict_scales - iou_predict_truth)
            object_loss = tf.reduce_mean(
                tf.reduce_sum(tf.square(object_delta), axis=[1, 2, 3]),
                name='object_loss') * self.object_scale

            # noobject_loss
            noobject_delta = noobject_mask * predict_scales
            noobject_loss = tf.reduce_mean(
                tf.reduce_sum(tf.square(noobject_delta), axis=[1, 2, 3]),
                name='noobject_loss') * self.noobject_scale

            # coord_loss
            coord_mask = tf.expand_dims(object_mask, 4)
            boxes_delta = coord_mask * (predict_boxes - boxes_tran)
            coord_loss = tf.reduce_mean(
                tf.reduce_sum(tf.square(boxes_delta), axis=[1, 2, 3, 4]),
                name='coord_loss') * self.coord_scale

            tf.losses.add_loss(class_loss)
            tf.losses.add_loss(object_loss)
            tf.losses.add_loss(noobject_loss)
            tf.losses.add_loss(coord_loss)

            tf.summary.scalar('class_loss', class_loss)
            tf.summary.scalar('object_loss', object_loss)
            tf.summary.scalar('noobject_loss', noobject_loss)
            tf.summary.scalar('coord_loss', coord_loss)

            tf.summary.histogram('boxes_delta_x', boxes_delta[..., 0])
            tf.summary.histogram('boxes_delta_y', boxes_delta[..., 1])
            tf.summary.histogram('boxes_delta_w', boxes_delta[..., 2])
            tf.summary.histogram('boxes_delta_h', boxes_delta[..., 3])
            tf.summary.histogram('iou', iou_predict_truth)


def leaky_relu(alpha):
    def op(inputs):
        return tf.nn.leaky_relu(inputs, alpha=alpha, name='leaky_relu')
    return op

train.py – 训练设置

import os
import argparse
import datetime
import tensorflow as tf
import yolo.config as cfg
from yolo.yolo_net import YOLONet
from utils.timer import Timer
from utils.pascal_voc import pascal_voc

slim = tf.contrib.slim


class Solver(object):

    def __init__(self, net, data):
        self.net = net
        self.data = data
        self.weights_file = cfg.WEIGHTS_FILE
        self.max_iter = cfg.MAX_ITER
        self.initial_learning_rate = cfg.LEARNING_RATE
        self.decay_steps = cfg.DECAY_STEPS
        self.decay_rate = cfg.DECAY_RATE
        self.staircase = cfg.STAIRCASE
        self.summary_iter = cfg.SUMMARY_ITER
        self.save_iter = cfg.SAVE_ITER
        self.output_dir = os.path.join(
            cfg.OUTPUT_DIR, datetime.datetime.now().strftime('%Y_%m_%d_%H_%M'))
        if not os.path.exists(self.output_dir):
            os.makedirs(self.output_dir)
        self.save_cfg()

        #将流程图画到tensorboard上
        self.variable_to_restore = tf.global_variables()    #返回所用变量的list
        self.saver = tf.train.Saver(self.variable_to_restore, max_to_keep=None) #保存list下的所有变量
        self.ckpt_file = os.path.join(self.output_dir, 'yolo')      #日志文件保存路径
        self.summary_op = tf.summary.merge_all()        #打包所有summary文件用于tensorboard显示
        self.writer = tf.summary.FileWriter(self.output_dir, flush_secs=60)     #保存

        self.global_step = tf.train.create_global_step()                #用于训练过程的global_step(学习率变化周期),每执行一次加1
        self.learning_rate = tf.train.exponential_decay(
            self.initial_learning_rate, self.global_step, self.decay_steps,
            self.decay_rate, self.staircase, name='learning_rate')      #学习率衰减,根据训练次数和衰减周次,按衰减率变化
        self.optimizer = tf.train.GradientDescentOptimizer(
            learning_rate=self.learning_rate)
        self.train_op = slim.learning.create_train_op(
            self.net.total_loss, self.optimizer, global_step=self.global_step)#根据损失定义优化操作

        gpu_options = tf.GPUOptions()           #用于限制GPU资源
        config = tf.ConfigProto(gpu_options=gpu_options)    #初始化GPU资源分配
        self.sess = tf.Session(config=config)               #配置生效
        self.sess.run(tf.global_variables_initializer())    #全局初始化

        #保存权重参数
        if self.weights_file is not None:
            print('Restoring weights from: ' + self.weights_file)
            self.saver.restore(self.sess, self.weights_file)

        self.writer.add_graph(self.sess.graph)  #添加计算图文件,用于保存

    def train(self):

        train_timer = Timer()
        load_timer = Timer()

        for step in range(1, self.max_iter + 1):
            #计时工具,用于分析模型运行速度
            load_timer.tic()#开始计时
            images, labels = self.data.get()
            load_timer.toc()#结束计时
            feed_dict = {self.net.images: images,
                         self.net.labels: labels}  #一步一个批次,一个批次45张  1epoch=sum(img)/batch_size = 17125/45 = 381批次

            if step % self.summary_iter == 0:       #每10步记录summary
                if step % (self.summary_iter * 10) == 0:    #每100步打印日志

                    train_timer.tic()
                    summary_str, loss, _ = self.sess.run(
                        [self.summary_op, self.net.total_loss, self.train_op],
                        feed_dict=feed_dict)
                    train_timer.toc()

                    log_str = '''{} Epoch: {}, Step: {}, Learning rate: {},'''
                    ''' Loss: {:5.3f}\nSpeed: {:.3f}s/iter,'''
                    '''' Load: {:.3f}s/iter, Remain: {}'''.format(
                        datetime.datetime.now().strftime('%m-%d %H:%M:%S'),
                        self.data.epoch,
                        int(step),
                        round(self.learning_rate.eval(session=self.sess), 6),
                        loss,
                        train_timer.average_time,
                        load_timer.average_time,
                        train_timer.remain(step, self.max_iter))
                    print(log_str)

                else:
                    train_timer.tic()
                    summary_str, _ = self.sess.run(
                        [self.summary_op, self.train_op],
                        feed_dict=feed_dict)
                    train_timer.toc()

                self.writer.add_summary(summary_str, step)

            else:
                train_timer.tic()
                self.sess.run(self.train_op, feed_dict=feed_dict)
                train_timer.toc()

            if step % self.save_iter == 0:      #每1000次保存检查点文件
                print('{} Saving checkpoint file to: {}'.format(
                    datetime.datetime.now().strftime('%m-%d %H:%M:%S'),
                    self.output_dir))
                self.saver.save(
                    self.sess, self.ckpt_file, global_step=self.global_step)

    def save_cfg(self):

        with open(os.path.join(self.output_dir, 'config.txt'), 'w') as f:
            cfg_dict = cfg.__dict__
            for key in sorted(cfg_dict.keys()):
                if key[0].isupper():
                    cfg_str = '{}: {}\n'.format(key, cfg_dict[key])
                    f.write(cfg_str)


def update_config_paths(data_dir, weights_file):
    cfg.DATA_PATH = data_dir
    cfg.PASCAL_PATH = os.path.join(data_dir, 'pascal_voc')
    cfg.CACHE_PATH = os.path.join(cfg.PASCAL_PATH, 'cache')
    cfg.OUTPUT_DIR = os.path.join(cfg.PASCAL_PATH, 'output')
    cfg.WEIGHTS_DIR = os.path.join(cfg.PASCAL_PATH, 'weights')

    cfg.WEIGHTS_FILE = os.path.join(cfg.WEIGHTS_DIR, weights_file)


def main():
    parser = argparse.ArgumentParser()      #编写命令行接口
    parser.add_argument('--weights', default="YOLO_small.ckpt", type=str)
    parser.add_argument('--data_dir', default="data", type=str)
    parser.add_argument('--threshold', default=0.2, type=float)
    parser.add_argument('--iou_threshold', default=0.5, type=float)
    parser.add_argument('--gpu', default='', type=str)
    args = parser.parse_args()

    if args.gpu is not None:
        cfg.GPU = args.gpu

    if args.data_dir != cfg.DATA_PATH:
        update_config_paths(args.data_dir, args.weights)

    os.environ['CUDA_VISIBLE_DEVICES'] = cfg.GPU

    yolo = YOLONet()
    pascal = pascal_voc('train')

    solver = Solver(yolo, pascal)

    print('Start training ...')
    solver.train()
    print('Done training.')


if __name__ == '__main__':

    # python train.py --weights YOLO_small.ckpt --gpu 0
    main()

test.py – 测试设置,画出框及判断类别

import os
import cv2
import argparse
import numpy as np
import tensorflow as tf
import yolo.config as cfg
from yolo.yolo_net import YOLONet
from utils.timer import Timer


class Detector(object):

    def __init__(self, net, weight_file):
        self.net = net
        self.weights_file = weight_file

        self.classes = cfg.CLASSES
        self.num_class = len(self.classes)
        self.image_size = cfg.IMAGE_SIZE
        self.cell_size = cfg.CELL_SIZE
        self.boxes_per_cell = cfg.BOXES_PER_CELL
        self.threshold = cfg.THRESHOLD
        self.iou_threshold = cfg.IOU_THRESHOLD
        self.boundary1 = self.cell_size * self.cell_size * self.num_class
        self.boundary2 = self.boundary1 +\
            self.cell_size * self.cell_size * self.boxes_per_cell

        self.sess = tf.Session()
        self.sess.run(tf.global_variables_initializer())

        print('Restoring weights from: ' + self.weights_file)
        self.saver = tf.train.Saver()
        self.saver.restore(self.sess, self.weights_file)

    def draw_result(self, img, result): #画出bbox
        for i in range(len(result)):
            x = int(result[i][1])
            y = int(result[i][2])
            w = int(result[i][3] / 2)
            h = int(result[i][4] / 2)
            cv2.rectangle(img, (x - w, y - h), (x + w, y + h), (0, 255, 0), 2)
            cv2.rectangle(img, (x - w, y - h - 20),
                          (x + w, y - h), (125, 125, 125), -1)
            lineType = cv2.LINE_AA if cv2.__version__ > '3' else cv2.CV_AA
            cv2.putText(
                img, result[i][0] + ' : %.2f' % result[i][5],
                (x - w + 5, y - h - 7), cv2.FONT_HERSHEY_SIMPLEX, 0.5,
                (0, 0, 0), 1, lineType)     #在框上写出类别名称

    def detect(self, img):  #将预测集上预测的转换回原图像中
        #图片数据预处理
        img_h, img_w, _ = img.shape
        inputs = cv2.resize(img, (self.image_size, self.image_size))
        inputs = cv2.cvtColor(inputs, cv2.COLOR_BGR2RGB).astype(np.float32)
        inputs = (inputs / 255.0) * 2.0 - 1.0 #(-1-1)
        inputs = np.reshape(inputs, (1, self.image_size, self.image_size, 3))#变成四维

        result = self.detect_from_cvmat(inputs)[0]      #只有一张图片

        for i in range(len(result)):                    #同时存放多个类型的物体
            result[i][1] *= (1.0 * img_w / self.image_size)     #位置缩放  坐标的尺寸  原图于标准化的缩放变化
            result[i][2] *= (1.0 * img_h / self.image_size)
            result[i][3] *= (1.0 * img_w / self.image_size)
            result[i][4] *= (1.0 * img_h / self.image_size)

        return result

    def detect_from_cvmat(self, inputs):        #预测测试集的结果
        net_output = self.sess.run(self.net.logits,
                                   feed_dict={self.net.images: inputs})
        results = []
        for i in range(net_output.shape[0]):        #批次处理图片数量   net_output = None*1470
            results.append(self.interpret_output(net_output[i]))    #一次处理一个卷积结果

        return results      #(sum(batch)xnum)x(class,xc,yc,w,h,score)

    def interpret_output(self, output): #将预测结果转换,变成对应图像上的实际长宽,以及总共预测有多少个类别,返回预测结果
        #将各维信息提取出来
        probs = np.zeros((self.cell_size, self.cell_size,
                          self.boxes_per_cell, self.num_class))
        class_probs = np.reshape(
            output[0:self.boundary1],
            (self.cell_size, self.cell_size, self.num_class))
        scales = np.reshape(
            output[self.boundary1:self.boundary2],
            (self.cell_size, self.cell_size, self.boxes_per_cell))
        boxes = np.reshape(
            output[self.boundary2:],
            (self.cell_size, self.cell_size, self.boxes_per_cell, 4))
        offset = np.array(
            [np.arange(self.cell_size)] * self.cell_size * self.boxes_per_cell)
        offset = np.transpose(
            np.reshape(
                offset,
                [self.boxes_per_cell, self.cell_size, self.cell_size]),
            (1, 2, 0))

        boxes[:, :, :, 0] += offset     #加上x方向的Xcell偏置,就是还原回原图大小的位置
        boxes[:, :, :, 1] += np.transpose(offset, (1, 0, 2))    #y也一样
        boxes[:, :, :, :2] = 1.0 * boxes[:, :, :, 0:2] / self.cell_size #得到真实的(xc,yc)=>最后一维为真正计算的量,其他维数为位置上的对应关系
        boxes[:, :, :, 2:] = np.square(boxes[:, :, :, 2:])  #之前开方的,平方回来

        boxes *= self.image_size

        for i in range(self.boxes_per_cell):
            for j in range(self.num_class):
                probs[:, :, i, j] = np.multiply(
                    class_probs[:, :, j], scales[:, :, i])  #计算框的类概率  P(c) * P(o|c) = P(o)
        #分割阈值,概率大于等于1的保留
        filter_mat_probs = np.array(probs >= self.threshold, dtype='bool')#高于阈值的才要,不是0就是1
        filter_mat_boxes = np.nonzero(filter_mat_probs) #不为0的记录
        boxes_filtered = boxes[filter_mat_boxes[0],
                               filter_mat_boxes[1], filter_mat_boxes[2]]    #前三行要筛选,筛选所有非零元素值,返回二维数组
        probs_filtered = probs[filter_mat_probs]       #筛选出所有非0元素值 一维list
        classes_num_filtered = np.argmax(   #返回轴3上值为1的三位索引,构成数组的位置
            filter_mat_probs, axis=3)[
            filter_mat_boxes[0], filter_mat_boxes[1], filter_mat_boxes[2]]

        argsort = np.array(np.argsort(probs_filtered))[::-1] #按照概率从大到小排列
        boxes_filtered = boxes_filtered[argsort]
        probs_filtered = probs_filtered[argsort]
        classes_num_filtered = classes_num_filtered[argsort]

        for i in range(len(boxes_filtered)):    #nms非极大值抑制
            if probs_filtered[i] == 0:
                continue
            for j in range(i + 1, len(boxes_filtered)):
                if self.iou(boxes_filtered[i], boxes_filtered[j]) > self.iou_threshold: #计算两个框的重合度,如果重合度超过阈值就舍弃
                    probs_filtered[j] = 0.0 #避免重复操作,从第一个开始递归

        filter_iou = np.array(probs_filtered > 0.0, dtype='bool')   #最后筛选出来的结果,有几个非零值,就有几个目标物体
        boxes_filtered = boxes_filtered[filter_iou]                 #得到目标的box
        probs_filtered = probs_filtered[filter_iou]                 #得到目标的得分
        classes_num_filtered = classes_num_filtered[filter_iou]     #得到目标的class
        '''
            上述将高维矩阵中非零元素转化为低维数组的方法实例如下
            import numpy as np
            a = np.array([[0,0,3],[0,0,0],[0,0,9]])
            b = np.nonzero(a)  #output:(array([0,2],dtype=int64),array([2,2]),dtype=int64),将三维数组的非0元素转成了一维数组输出了
            b[0]               #output:(array[0,2],dtype=int64)
            c = a>0            #output:array([false,false,true],
                                #       [false,false,false],
                                #       [false,false,true])
            a[c]               #转成了低维输出  array([3,9])
            a[b]                #和a[c]结果一样  array([3,9])
            
        '''

        result = []
        for i in range(len(boxes_filtered)):
            result.append(
                [self.classes[classes_num_filtered[i]],
                 boxes_filtered[i][0],
                 boxes_filtered[i][1],
                 boxes_filtered[i][2],
                 boxes_filtered[i][3],
                 probs_filtered[i]])

        return result   #返回num X (class,xc,yc,w,h,score)

    def iou(self, box1, box2):      #计算重合区域
        tb = min(box1[0] + 0.5 * box1[2], box2[0] + 0.5 * box2[2]) - \
            max(box1[0] - 0.5 * box1[2], box2[0] - 0.5 * box2[2])
        lr = min(box1[1] + 0.5 * box1[3], box2[1] + 0.5 * box2[3]) - \
            max(box1[1] - 0.5 * box1[3], box2[1] - 0.5 * box2[3])
        inter = 0 if tb < 0 or lr < 0 else tb * lr
        return inter / (box1[2] * box1[3] + box2[2] * box2[3] - inter)

    def camera_detector(self, cap, wait=10):    #计算检测时间,拍照功能
        detect_timer = Timer()
        ret, _ = cap.read()

        while ret:
            ret, frame = cap.read()
            detect_timer.tic()
            result = self.detect(frame)
            detect_timer.toc()
            print('Average detecting time: {:.3f}s'.format(
                detect_timer.average_time))

            self.draw_result(frame, result)
            cv2.imshow('Camera', frame)
            cv2.waitKey(wait)

            ret, frame = cap.read()

    def image_detector(self, imname, wait=0):   #检测直接用图片的检测时间
        detect_timer = Timer()
        image = cv2.imread(imname)

        detect_timer.tic()
        result = self.detect(image)
        detect_timer.toc()
        print('Average detecting time: {:.3f}s'.format(
            detect_timer.average_time))

        self.draw_result(image, result)
        cv2.imshow('Image', image)
        cv2.waitKey(wait)


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--weights', default="YOLO_small.ckpt", type=str)
    parser.add_argument('--weight_dir', default='weights', type=str)
    parser.add_argument('--data_dir', default="data", type=str)
    parser.add_argument('--gpu', default='', type=str)
    args = parser.parse_args()

    os.environ['CUDA_VISIBLE_DEVICES'] = args.gpu

    yolo = YOLONet(False)   #False--测试,True--训练
    weight_file = os.path.join(args.data_dir, args.weight_dir, args.weights)    #拿出训练好的权重文件
    detector = Detector(yolo, weight_file)

    # detect from camera    #实时拍照检测时用
    # cap = cv2.VideoCapture(-1)
    # detector.camera_detector(cap)

    # detect from image file  #拿图片来测试
    imname = 'test/person.jpg'
    detector.image_detector(imname)


if __name__ == '__main__':
    main()

你可能感兴趣的:(YOLO,计算机视觉)