参考文章:
https://blog.csdn.net/qq_38232598/article/details/88695454
·端到端:采用直接回归的思路,将目标定位和目标类别预测整合于单个神经网络模型中
·直接在输出层回归bbox的位置和所属类别 – 快!!!
1.将一幅图像分为SXS个网格,如果某个object的中心落在网格中(通过ground-truth框确定),则对应网格就负责预测这个object
2.对于网络预测的Bounding box,除了回归自身位置外,还需要带一个confidence值,该值代表所预测的框中是否含有object和若含有object,该object预测的多准概率的乘积
3、每个网格单元针对20种类别预测bboxes属于单个类别的条件概率P r ( Classi∣Object) ,属于同一个网格的B个bboxes共享一个条件概率。在测试时,将条件概率分别和单个的bbox的confidence预测相乘:
YOLO使用了24个级联卷积层和最后2个全连接层,1x1的卷积层目的是降低前面层的特征空间。在ImageNet任务中,使用分辨率为224x224的输入图像对卷积层进行预训练,然后将分辨率加倍进行目标检测
bounding box预测的五个之分别是:框的坐标x,y值,框的宽度,长度,以及confidence值
因此输出的tensor张量计算公式为SxSx(5xB+C),(x,y)表示bounding box相对于网格单元的边界的offset,归一化到0-1的范围内,而w,h相对于整个图片的预测宽和高,前四个值都被归一化到0-1的范围内。c表示的则是object在bounding box的confidence值
假设图片划分为7x7个网格,宽为wi,高为hi
1.x,y–bbox的中心相对于单元格的offset
如下图所示,蓝色格的坐标为(Xcol = 1,Yrow = 4),假设预测输出时红色框bbox,设该框中心坐标为(xc,yc),则最终预测的(x,y)经归一化公式处理后,得到相对于单元格的offset
2. w,h – bbox相对于整张图片的比例
每个grid cell对应的两个bounding box的5个值在tensor中体现出来,组成前十维。每个单元格负责预测一个object,但是每个单元格预测出两个bounding box
tensor的前10维表示bounding box的信息,剩下20维则表示与类别相关的信息,其中对应维的值表示此object在此bbox中的时候,属于class_i的概率分数
每个单元格预测一个属于类别classi的条件概率Pr(Classi|Object).言注意的是,属于一个网格的两个bbox共享一套条件概率,预测出一个object类别的最终值
根据上式,可以计算每个bbox的class-specific confidence score分数,对每个格子的每一个bounding box进行该运算,最后得到7x7x2=98个score(每个格子两个盒子),设置一个阈值,过滤掉低分的盒子,对保留的bbox进行NMS(Non Maximum Suppression)处理,最后得到检测结果
经过上一步计算,可以得到每个bbox的class-specific confidence score
· 首先通过设定阈值,过滤掉得分低的bboxs
· 将所得分数从大到小进行排序
· 排序后,将得分最高的盒子标为bbox_max
· 将bbox和其他分数较低但不为零的盒子(bbox_cur)做对比,主要是计算它们的重合度,具体做IOU计算,如果IOU(bbox_max,bbox_cur)> 0.5,则判定重合度过大,将bbox cur的分数设为0
· 接着遍历下一个score,如果它不是最大的,且不为0,则和最大的score bbox做IOU运算(同上),如果补充和,则score不变,继续处理下一个bbox_cur
· 计算完一轮后,得到新序列,找出score第二的bbox_cur,作为下一轮的bbox_max(如上图的0.2),循环计算后面的bbox_cur于新的bbox_max的IOU值(和之前步骤一致),一直循环直到最后一个score。上图最后得到的score排列为(0.5,0.2,0,…,0)
· 其他类别的预测也一样
经过前面NMS算法的处理,会有很多score为0的情况
最后,针对每个bbox的20个类别score:
1.找出bbox中score最大的类别的索引号,记为class
2.找出该bbox的最大score,记为score
3.判断score是否大于0,如果是,则在图像中画出标有class的框。否则,丢弃该bbox
·最终结果:
在损失函数中,只有某个网格中有object的中心落入时,才对classification loss做惩罚,只有某个网格i中的bbox对某个ground-truth box负责时,才对coordinate error进行惩罚。那个bbox对ground-truth box负责通过IOU最大设定
· LOSS参数解释
1.在实际预测中,localization error和classfication error的重要性是不一样的,为了加大localization error对结果的影响,增加了λcoord权重,在Pascal VOC训练中取5
2.对于不存在object的bbox,一般赋予更小的损失权重,记为λnoobj,在Pascal VOC中取0.5。若没有物体中心落在框中,则置为0。0.5的意义在于解决因为没有权重而造成不平衡的现象
3.对于存在object的bbox的confidence loss和类别的loss,权重取常数1。对于每个格子而言,作者设计只能包含同种物体。若格子中包含物体,我们希望希望预测正确的类别的概率越接近于1越好,而错误类别的概率越接近于0越好。loss第4部分中,若p i ^ ( c ) 为0中c为正确类别,则值为1,若非正确类别,则值为0
4.wi和hi的误差取平方根计算原因在于,小物体的预测框偏离影响要比大物体偏离大,因此取平方根会加大小物体偏差的惩罚
1.假阳性(FP)低–YOLO网络将整个图像的全部信息作为上下文,结合了全图信息,因此能更好地区分目标和背景区域
2. 端到端(end-to-end),速度快 (tiny可以达到155fps)
3. 通用性强
1.性能表现比最先进的目标检测算法低
2.定位精确性差,小目标和密集群体定位效果不好(每个格子只能有一个box)
3.大小物体偏差还是难以处理
4.检测尺寸固定,分辨率固定
5、采用了多个下采样层,网络学到的物体特征不够精细
import os
import xml.etree.ElementTree as ET
import numpy as np
import cv2
import pickle
import copy
import yolo.config as cfg
class pascal_voc(object):
def __init__(self, phase, rebuild=False):
#找VOC数据集数据
self.devkil_path = os.path.join(cfg.PASCAL_PATH, 'VOCdevkit')
self.data_path = os.path.join(self.devkil_path, 'VOC2007')
#yolo/config.py设定的参数读取
self.cache_path = cfg.CACHE_PATH #初始化参数
self.batch_size = cfg.BATCH_SIZE #每次训练45个
self.image_size = cfg.IMAGE_SIZE #图片大小-448
self.cell_size = cfg.CELL_SIZE #每张图片划分的网格数--7*7
self.classes = cfg.CLASSES #所有类别字典数据
self.class_to_ind = dict(zip(self.classes, range(len(self.classes))))
self.flipped = cfg.FLIPPED #是否图片翻转增强数据
self.phase = phase #数据文件选择标志
self.rebuild = rebuild #标签解析对象文件是否重建标志
self.cursor = 0
self.epoch = 1 #样本训练伦茨
self.gt_labels = None #标签列表字典对象
self.prepare() #获得准备好的数据
def get(self): #依次获取图像及其标签文件,生成器
images = np.zeros(
(self.batch_size, self.image_size, self.image_size, 3))
labels = np.zeros(
(self.batch_size, self.cell_size, self.cell_size, 25))
count = 0
while count < self.batch_size:
imname = self.gt_labels[self.cursor]['imname']
flipped = self.gt_labels[self.cursor]['flipped']
images[count, :, :, :] = self.image_read(imname, flipped)
labels[count, :, :, :] = self.gt_labels[self.cursor]['label']
count += 1
self.cursor += 1
if self.cursor >= len(self.gt_labels): #获取完一轮gt_labels所有数据后,打乱顺序,重新获取
np.random.shuffle(self.gt_labels)
self.cursor = 0
self.epoch += 1 #下一轮次
return images, labels
def image_read(self, imname, flipped=False): #读取图片
image = cv2.imread(imname)
image = cv2.resize(image, (self.image_size, self.image_size)) #重新设置图片大小
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB).astype(np.float32) #BGR转成三通道RGB格式
image = (image / 255.0) * 2.0 - 1.0 #像素归一化到(-1,1) 数据格式为浮点型
if flipped:
image = image[:, ::-1, :] #图片翻转增强
return image
def prepare(self): #完成标签数据准备的编码
gt_labels = self.load_labels() #将所有图像的信息提取出来
if self.flipped: #检测图像是否翻转,如果翻转要相应改变坐标位置
print('Appending horizontally-flipped training examples ...') #水平翻转
gt_labels_cp = copy.deepcopy(gt_labels)
for idx in range(len(gt_labels_cp)): #返回得到的可迭代数目
gt_labels_cp[idx]['flipped'] = True
gt_labels_cp[idx]['label'] =\
gt_labels_cp[idx]['label'][:, ::-1, :] #水平翻转标签,这里反转的是cell网格,xc(cell)
for i in range(self.cell_size):
for j in range(self.cell_size):
if gt_labels_cp[idx]['label'][i, j, 0] == 1: #有物体才进行操作
gt_labels_cp[idx]['label'][i, j, 1] = \
self.image_size - 1 -\
gt_labels_cp[idx]['label'][i, j, 1] #对标签的xc进行翻转,xc即图像的中心坐标(x.center)
gt_labels += gt_labels_cp
np.random.shuffle(gt_labels) #再第0个维度进行打乱
self.gt_labels = gt_labels
return gt_labels
def load_labels(self): #读取标签对象=》列表字典对象
cache_file = os.path.join(
self.cache_path, 'pascal_' + self.phase + '_gt_labels.pkl')
if os.path.isfile(cache_file) and not self.rebuild:
print('Loading gt_labels from: ' + cache_file)
with open(cache_file, 'rb') as f:
gt_labels = pickle.load(f) #将标签文件解析为对象
return gt_labels #不重建的化,直接return
print('Processing gt_labels from: ' + self.data_path)
if not os.path.exists(self.cache_path):
os.makedirs(self.cache_path)
if self.phase == 'train':
txtname = os.path.join(
self.data_path, 'ImageSets', 'Main', 'trainval.txt')
else:
txtname = os.path.join(
self.data_path, 'ImageSets', 'Main', 'test.txt')
with open(txtname, 'r') as f:
self.image_index = [x.strip() for x in f.readlines()] #读取图片编号 strip()删除开头或结尾的字
#将读取到的数据做成label,并存到gt_labels里
gt_labels = []
for index in self.image_index:
label, num = self.load_pascal_annotation(index)
if num == 0:
continue
imname = os.path.join(self.data_path, 'JPEGImages', index + '.jpg')
gt_labels.append({'imname': imname,
'label': label,
'flipped': False}) #组成列表
print('Saving gt_labels to: ' + cache_file)
with open(cache_file, 'wb') as f:
pickle.dump(gt_labels, f) #序列化列表字典文件为对象
return gt_labels
def load_pascal_annotation(self, index): #该方法的作用主要是从PASCAL VOC自带的xml中解析出bbox的坐标,并分配tensor参数
"""
Load image and bounding boxes info from XML file in the PASCAL VOC
format.
"""
imname = os.path.join(self.data_path, 'JPEGImages', index + '.jpg')
im = cv2.imread(imname)
h_ratio = 1.0 * self.image_size / im.shape[0] #因为重新设定了图像大小,因此实际坐标也要跟着改变
w_ratio = 1.0 * self.image_size / im.shape[1]
# im = cv2.resize(im, [self.image_size, self.image_size])
label = np.zeros((self.cell_size, self.cell_size, 25))
filename = os.path.join(self.data_path, 'Annotations', index + '.xml') #找出xml配置文件
tree = ET.parse(filename)
objs = tree.findall('object')
for obj in objs:
bbox = obj.find('bndbox')
# Make pixel indexes 0-based
'''
min() -- 让控制框不超过image_size,将超出边界的拉回来
max() -- 当框坐标小于0是,将其拉回到图像内
-1 == 让像素从0开始,索引也是从0开始
'''
x1 = max(min((float(bbox.find('xmin').text) - 1) * w_ratio, self.image_size - 1), 0)
y1 = max(min((float(bbox.find('ymin').text) - 1) * h_ratio, self.image_size - 1), 0)
x2 = max(min((float(bbox.find('xmax').text) - 1) * w_ratio, self.image_size - 1), 0)
y2 = max(min((float(bbox.find('ymax').text) - 1) * h_ratio, self.image_size - 1), 0)
cls_ind = self.class_to_ind[obj.find('name').text.lower().strip()] #让每一个类对应一个数字索引
boxes = [(x2 + x1) / 2.0, (y2 + y1) / 2.0, x2 - x1, y2 - y1] #(x.center,y.center,width,height)
x_ind = int(boxes[0] * self.cell_size / self.image_size) #分成7*7网格表示
y_ind = int(boxes[1] * self.cell_size / self.image_size)
#(confidence,x,y,width,height,class)
if label[y_ind, x_ind, 0] == 1: #同框多个obj时,一个cell只负责检测一个物体
continue
label[y_ind, x_ind, 0] = 1 #置信度
label[y_ind, x_ind, 1:5] = boxes
label[y_ind, x_ind, 5 + cls_ind] = 1
return label, len(objs)#返回标签和tensor的实际维数
import numpy as np
import tensorflow as tf
import yolo.config as cfg
slim = tf.contrib.slim
class YOLONet(object):
def __init__(self, is_training=True):
self.classes = cfg.CLASSES #分类字典
self.num_class = len(self.classes)
self.image_size = cfg.IMAGE_SIZE
self.cell_size = cfg.CELL_SIZE
self.boxes_per_cell = cfg.BOXES_PER_CELL
self.output_size = (self.cell_size * self.cell_size) *\
(self.num_class + self.boxes_per_cell * 5)
self.scale = 1.0 * self.image_size / self.cell_size
self.boundary1 = self.cell_size * self.cell_size * self.num_class
self.boundary2 = self.boundary1 +\
self.cell_size * self.cell_size * self.boxes_per_cell
self.object_scale = cfg.OBJECT_SCALE
self.noobject_scale = cfg.NOOBJECT_SCALE
self.class_scale = cfg.CLASS_SCALE
self.coord_scale = cfg.COORD_SCALE
self.learning_rate = cfg.LEARNING_RATE
self.batch_size = cfg.BATCH_SIZE
self.alpha = cfg.ALPHA
self.offset = np.transpose(np.reshape(np.array(
[np.arange(self.cell_size)] * self.cell_size * self.boxes_per_cell),
(self.boxes_per_cell, self.cell_size, self.cell_size)), (1, 2, 0)) #offset--单元划分后和图片的比值
self.images = tf.placeholder(
tf.float32, [None, self.image_size, self.image_size, 3],
name='images')
self.logits = self.build_network(
self.images, num_outputs=self.output_size, alpha=self.alpha,
is_training=is_training)
if is_training:
self.labels = tf.placeholder(
tf.float32,
[None, self.cell_size, self.cell_size, 5 + self.num_class])
self.loss_layer(self.logits, self.labels)
self.total_loss = tf.losses.get_total_loss()
tf.summary.scalar('total_loss', self.total_loss)
#定义yolo网络结构
def build_network(self,
images,
num_outputs,
alpha,
keep_prob=0.5,
is_training=True,
scope='yolo'):
with tf.variable_scope(scope):
with slim.arg_scope(
#定义要用的层
[slim.conv2d, slim.fully_connected],
activation_fn=leaky_relu(alpha),
weights_regularizer=slim.l2_regularizer(0.0005),
weights_initializer=tf.truncated_normal_initializer(0.0, 0.01)
):
net = tf.pad( #填充补数,作用是保证最后输出特征图大小是7*7
images, np.array([[0, 0], [3, 3], [3, 3], [0, 0]]),
name='pad_1')
net = slim.conv2d(
net, 64, 7, 2, padding='VALID', scope='conv_2') #7*7*64-s-2
net = slim.max_pool2d(net, 2, padding='SAME', scope='pool_3')
net = slim.conv2d(net, 192, 3, scope='conv_4') #3*3*192
net = slim.max_pool2d(net, 2, padding='SAME', scope='pool_5')
net = slim.conv2d(net, 128, 1, scope='conv_6')
net = slim.conv2d(net, 256, 3, scope='conv_7')
net = slim.conv2d(net, 256, 1, scope='conv_8')
net = slim.conv2d(net, 512, 3, scope='conv_9')
net = slim.max_pool2d(net, 2, padding='SAME', scope='pool_10')
net = slim.conv2d(net, 256, 1, scope='conv_11')
net = slim.conv2d(net, 512, 3, scope='conv_12')
net = slim.conv2d(net, 256, 1, scope='conv_13')
net = slim.conv2d(net, 512, 3, scope='conv_14')
net = slim.conv2d(net, 256, 1, scope='conv_15')
net = slim.conv2d(net, 512, 3, scope='conv_16')
net = slim.conv2d(net, 256, 1, scope='conv_17')
net = slim.conv2d(net, 512, 3, scope='conv_18')
net = slim.conv2d(net, 512, 1, scope='conv_19')
net = slim.conv2d(net, 1024, 3, scope='conv_20')
net = slim.max_pool2d(net, 2, padding='SAME', scope='pool_21')
net = slim.conv2d(net, 512, 1, scope='conv_22')
net = slim.conv2d(net, 1024, 3, scope='conv_23')
net = slim.conv2d(net, 512, 1, scope='conv_24')
net = slim.conv2d(net, 1024, 3, scope='conv_25')
net = slim.conv2d(net, 1024, 3, scope='conv_26')
net = tf.pad(
net, np.array([[0, 0], [1, 1], [1, 1], [0, 0]]),
name='pad_27')
net = slim.conv2d(
net, 1024, 3, 2, padding='VALID', scope='conv_28')
net = slim.conv2d(net, 1024, 3, scope='conv_29')
net = slim.conv2d(net, 1024, 3, scope='conv_30')
net = tf.transpose(net, [0, 3, 1, 2], name='trans_31') #转置为[0,3,1,2]的排列
net = slim.flatten(net, scope='flat_32') #输入扁平化,但保留batch_size,假设第一维是batch
net = slim.fully_connected(net, 512, scope='fc_33')
net = slim.fully_connected(net, 4096, scope='fc_34')
net = slim.dropout(
net, keep_prob=keep_prob, is_training=is_training,
scope='dropout_35')
net = slim.fully_connected(
net, num_outputs, activation_fn=None, scope='fc_36') #最后一层用线性激活函数
return net #最后输出的学习特征为7x7x30 = 1490,同时也能体现cellxcell*25等标签空间特征
def calc_iou(self, boxes1, boxes2, scope='iou'): #计算IOU
#将原始的中心点坐标和长宽,转换成矩形框左上角和右下角两个点坐标
"""calculate ious
Args:
boxes1: 5-D tensor [BATCH_SIZE, CELL_SIZE, CELL_SIZE, BOXES_PER_CELL, 4] ====> (x_center, y_center, w, h)
boxes2: 5-D tensor [BATCH_SIZE, CELL_SIZE, CELL_SIZE, BOXES_PER_CELL, 4] ===> (x_center, y_center, w, h)
Return:
iou: 4-D tensor [BATCH_SIZE, CELL_SIZE, CELL_SIZE, BOXES_PER_CELL]
"""
with tf.variable_scope(scope):
# transform (x_center, y_center, w, h) to (x1, y1, x2, y2) 1 -- 左上角 2--右下角
boxes1_t = tf.stack([boxes1[..., 0] - boxes1[..., 2] / 2.0,
boxes1[..., 1] - boxes1[..., 3] / 2.0,
boxes1[..., 0] + boxes1[..., 2] / 2.0,
boxes1[..., 1] + boxes1[..., 3] / 2.0],
axis=-1)
boxes2_t = tf.stack([boxes2[..., 0] - boxes2[..., 2] / 2.0,
boxes2[..., 1] - boxes2[..., 3] / 2.0,
boxes2[..., 0] + boxes2[..., 2] / 2.0,
boxes2[..., 1] + boxes2[..., 3] / 2.0],
axis=-1)
# calculate the left up point & right down point -- 划定IOU区域(相交矩形区域)
lu = tf.maximum(boxes1_t[..., :2], boxes2_t[..., :2]) #选坐标较大的左上角
rd = tf.minimum(boxes1_t[..., 2:], boxes2_t[..., 2:]) #选坐标较小的右下角
# intersection -- 计算相交区域大小
intersection = tf.maximum(0.0, rd - lu) #IOU的长和宽
inter_square = intersection[..., 0] * intersection[..., 1] #IOU区域面积,交集的大小
# calculate the boxs1 square and boxs2 square
square1 = boxes1[..., 2] * boxes1[..., 3]
square2 = boxes2[..., 2] * boxes2[..., 3]
union_square = tf.maximum(square1 + square2 - inter_square, 1e-10) #两个框的真实面积(减去一次重叠部分),也就是并集的大小
return tf.clip_by_value(inter_square / union_square, 0.0, 1.0) #计算出两个框的交并比,并规定在0-1之间
def loss_layer(self, predicts, labels, scope='loss_layer'):
with tf.variable_scope(scope):
predict_classes = tf.reshape(
predicts[:, :self.boundary1],
[self.batch_size, self.cell_size, self.cell_size, self.num_class]) #预测类别
predict_scales = tf.reshape(
predicts[:, self.boundary1:self.boundary2],
[self.batch_size, self.cell_size, self.cell_size, self.boxes_per_cell]) #预测置信度
predict_boxes = tf.reshape(
predicts[:, self.boundary2:],
[self.batch_size, self.cell_size, self.cell_size, self.boxes_per_cell, 4]) #预测框的信息,一个cell两个盒子,所以占8维
#拿ground-truth的值,准备计算损失函数
response = tf.reshape( #responce--提取label中的置信度,表示该地方是否有狂
labels[..., 0],
[self.batch_size, self.cell_size, self.cell_size, 1])
boxes = tf.reshape(
labels[..., 1:5],
[self.batch_size, self.cell_size, self.cell_size, 1, 4]) #升维度,提取label的框信息 盒子在前5维
# tf.tile 对boxes张量进行赋值,扩张规则为第二个参数(对轴3赋值两次,因为一个cell两个框)
boxes = tf.tile(
boxes, [1, 1, 1, self.boxes_per_cell, 1]) / self.image_size #进行Xc/image_size的归一化操作,
# 为了匹配predict_scales/self.image_size,处理wi,hi
classes = labels[..., 5:] #5维后为分类类别,提取label的类
offset = tf.reshape( #offsetz作用是将每个cell的坐标对齐
tf.constant(self.offset, dtype=tf.float32),
[1, self.cell_size, self.cell_size, self.boxes_per_cell]) #offset = none,cell,cell,per_Cell(四维).Cell能取0-6
offset = tf.tile(offset, [self.batch_size, 1, 1, 1]) #扩张为batch_size的大小,表示一批
offset_tran = tf.transpose(offset, (0, 2, 1, 3)) #转置,将每个cell的坐标对齐,同时赋一个偏置初值(0-1)之间
predict_boxes_tran = tf.stack( #矩阵拼接
[(predict_boxes[..., 0] + offset) / self.cell_size,
(predict_boxes[..., 1] + offset_tran) / self.cell_size,
tf.square(predict_boxes[..., 2]), #xc = (x(相对)+xcol)*wi/S
tf.square(predict_boxes[..., 3])], axis=-1) #初始化 predict_boxes_tram[...,:]
iou_predict_truth = self.calc_iou(predict_boxes_tran, boxes) #iou_predict_truth [BATCH_SIZE, CELL_SIZE, CELL_SIZE, BOXES_PER_CELL]
# calculate I tensor [BATCH_SIZE, CELL_SIZE, CELL_SIZE, BOXES_PER_CELL],计算有obj的区域,就是猜中了的地方
object_mask = tf.reduce_max(iou_predict_truth, 3, keep_dims=True) #得到交并比较大的框
object_mask = tf.cast( #tf.cast--点乘 高维矩阵没有叉乘
(iou_predict_truth >= object_mask), tf.float32) * response
# calculate no_I tensor [CELL_SIZE, CELL_SIZE, BOXES_PER_CELL],计算noobj区域
noobject_mask = tf.ones_like(
object_mask, dtype=tf.float32) - object_mask #ones_like将值全为1,再减有目标的,等于没目标的
boxes_tran = tf.stack( #计算bbox的相对位置信息
[boxes[..., 0] * self.cell_size - offset, # (Xc*S-Xcol)/wi =x =>相对值 wi在 185行处理的
boxes[..., 1] * self.cell_size - offset_tran, #y一样,wi,hi前面已经处理过
tf.sqrt(boxes[..., 2]),
tf.sqrt(boxes[..., 3])], axis=-1) #对于损失函数,w,h要开方
#开始计算损失
# class_loss
class_delta = response * (predict_classes - classes) #置信度*类别差
class_loss = tf.reduce_mean(
tf.reduce_sum(tf.square(class_delta), axis=[1, 2, 3]),
name='class_loss') * self.class_scale
# object_loss
object_delta = object_mask * (predict_scales - iou_predict_truth)
object_loss = tf.reduce_mean(
tf.reduce_sum(tf.square(object_delta), axis=[1, 2, 3]),
name='object_loss') * self.object_scale
# noobject_loss
noobject_delta = noobject_mask * predict_scales
noobject_loss = tf.reduce_mean(
tf.reduce_sum(tf.square(noobject_delta), axis=[1, 2, 3]),
name='noobject_loss') * self.noobject_scale
# coord_loss
coord_mask = tf.expand_dims(object_mask, 4)
boxes_delta = coord_mask * (predict_boxes - boxes_tran)
coord_loss = tf.reduce_mean(
tf.reduce_sum(tf.square(boxes_delta), axis=[1, 2, 3, 4]),
name='coord_loss') * self.coord_scale
tf.losses.add_loss(class_loss)
tf.losses.add_loss(object_loss)
tf.losses.add_loss(noobject_loss)
tf.losses.add_loss(coord_loss)
tf.summary.scalar('class_loss', class_loss)
tf.summary.scalar('object_loss', object_loss)
tf.summary.scalar('noobject_loss', noobject_loss)
tf.summary.scalar('coord_loss', coord_loss)
tf.summary.histogram('boxes_delta_x', boxes_delta[..., 0])
tf.summary.histogram('boxes_delta_y', boxes_delta[..., 1])
tf.summary.histogram('boxes_delta_w', boxes_delta[..., 2])
tf.summary.histogram('boxes_delta_h', boxes_delta[..., 3])
tf.summary.histogram('iou', iou_predict_truth)
def leaky_relu(alpha):
def op(inputs):
return tf.nn.leaky_relu(inputs, alpha=alpha, name='leaky_relu')
return op
import os
import argparse
import datetime
import tensorflow as tf
import yolo.config as cfg
from yolo.yolo_net import YOLONet
from utils.timer import Timer
from utils.pascal_voc import pascal_voc
slim = tf.contrib.slim
class Solver(object):
def __init__(self, net, data):
self.net = net
self.data = data
self.weights_file = cfg.WEIGHTS_FILE
self.max_iter = cfg.MAX_ITER
self.initial_learning_rate = cfg.LEARNING_RATE
self.decay_steps = cfg.DECAY_STEPS
self.decay_rate = cfg.DECAY_RATE
self.staircase = cfg.STAIRCASE
self.summary_iter = cfg.SUMMARY_ITER
self.save_iter = cfg.SAVE_ITER
self.output_dir = os.path.join(
cfg.OUTPUT_DIR, datetime.datetime.now().strftime('%Y_%m_%d_%H_%M'))
if not os.path.exists(self.output_dir):
os.makedirs(self.output_dir)
self.save_cfg()
#将流程图画到tensorboard上
self.variable_to_restore = tf.global_variables() #返回所用变量的list
self.saver = tf.train.Saver(self.variable_to_restore, max_to_keep=None) #保存list下的所有变量
self.ckpt_file = os.path.join(self.output_dir, 'yolo') #日志文件保存路径
self.summary_op = tf.summary.merge_all() #打包所有summary文件用于tensorboard显示
self.writer = tf.summary.FileWriter(self.output_dir, flush_secs=60) #保存
self.global_step = tf.train.create_global_step() #用于训练过程的global_step(学习率变化周期),每执行一次加1
self.learning_rate = tf.train.exponential_decay(
self.initial_learning_rate, self.global_step, self.decay_steps,
self.decay_rate, self.staircase, name='learning_rate') #学习率衰减,根据训练次数和衰减周次,按衰减率变化
self.optimizer = tf.train.GradientDescentOptimizer(
learning_rate=self.learning_rate)
self.train_op = slim.learning.create_train_op(
self.net.total_loss, self.optimizer, global_step=self.global_step)#根据损失定义优化操作
gpu_options = tf.GPUOptions() #用于限制GPU资源
config = tf.ConfigProto(gpu_options=gpu_options) #初始化GPU资源分配
self.sess = tf.Session(config=config) #配置生效
self.sess.run(tf.global_variables_initializer()) #全局初始化
#保存权重参数
if self.weights_file is not None:
print('Restoring weights from: ' + self.weights_file)
self.saver.restore(self.sess, self.weights_file)
self.writer.add_graph(self.sess.graph) #添加计算图文件,用于保存
def train(self):
train_timer = Timer()
load_timer = Timer()
for step in range(1, self.max_iter + 1):
#计时工具,用于分析模型运行速度
load_timer.tic()#开始计时
images, labels = self.data.get()
load_timer.toc()#结束计时
feed_dict = {self.net.images: images,
self.net.labels: labels} #一步一个批次,一个批次45张 1epoch=sum(img)/batch_size = 17125/45 = 381批次
if step % self.summary_iter == 0: #每10步记录summary
if step % (self.summary_iter * 10) == 0: #每100步打印日志
train_timer.tic()
summary_str, loss, _ = self.sess.run(
[self.summary_op, self.net.total_loss, self.train_op],
feed_dict=feed_dict)
train_timer.toc()
log_str = '''{} Epoch: {}, Step: {}, Learning rate: {},'''
''' Loss: {:5.3f}\nSpeed: {:.3f}s/iter,'''
'''' Load: {:.3f}s/iter, Remain: {}'''.format(
datetime.datetime.now().strftime('%m-%d %H:%M:%S'),
self.data.epoch,
int(step),
round(self.learning_rate.eval(session=self.sess), 6),
loss,
train_timer.average_time,
load_timer.average_time,
train_timer.remain(step, self.max_iter))
print(log_str)
else:
train_timer.tic()
summary_str, _ = self.sess.run(
[self.summary_op, self.train_op],
feed_dict=feed_dict)
train_timer.toc()
self.writer.add_summary(summary_str, step)
else:
train_timer.tic()
self.sess.run(self.train_op, feed_dict=feed_dict)
train_timer.toc()
if step % self.save_iter == 0: #每1000次保存检查点文件
print('{} Saving checkpoint file to: {}'.format(
datetime.datetime.now().strftime('%m-%d %H:%M:%S'),
self.output_dir))
self.saver.save(
self.sess, self.ckpt_file, global_step=self.global_step)
def save_cfg(self):
with open(os.path.join(self.output_dir, 'config.txt'), 'w') as f:
cfg_dict = cfg.__dict__
for key in sorted(cfg_dict.keys()):
if key[0].isupper():
cfg_str = '{}: {}\n'.format(key, cfg_dict[key])
f.write(cfg_str)
def update_config_paths(data_dir, weights_file):
cfg.DATA_PATH = data_dir
cfg.PASCAL_PATH = os.path.join(data_dir, 'pascal_voc')
cfg.CACHE_PATH = os.path.join(cfg.PASCAL_PATH, 'cache')
cfg.OUTPUT_DIR = os.path.join(cfg.PASCAL_PATH, 'output')
cfg.WEIGHTS_DIR = os.path.join(cfg.PASCAL_PATH, 'weights')
cfg.WEIGHTS_FILE = os.path.join(cfg.WEIGHTS_DIR, weights_file)
def main():
parser = argparse.ArgumentParser() #编写命令行接口
parser.add_argument('--weights', default="YOLO_small.ckpt", type=str)
parser.add_argument('--data_dir', default="data", type=str)
parser.add_argument('--threshold', default=0.2, type=float)
parser.add_argument('--iou_threshold', default=0.5, type=float)
parser.add_argument('--gpu', default='', type=str)
args = parser.parse_args()
if args.gpu is not None:
cfg.GPU = args.gpu
if args.data_dir != cfg.DATA_PATH:
update_config_paths(args.data_dir, args.weights)
os.environ['CUDA_VISIBLE_DEVICES'] = cfg.GPU
yolo = YOLONet()
pascal = pascal_voc('train')
solver = Solver(yolo, pascal)
print('Start training ...')
solver.train()
print('Done training.')
if __name__ == '__main__':
# python train.py --weights YOLO_small.ckpt --gpu 0
main()
import os
import cv2
import argparse
import numpy as np
import tensorflow as tf
import yolo.config as cfg
from yolo.yolo_net import YOLONet
from utils.timer import Timer
class Detector(object):
def __init__(self, net, weight_file):
self.net = net
self.weights_file = weight_file
self.classes = cfg.CLASSES
self.num_class = len(self.classes)
self.image_size = cfg.IMAGE_SIZE
self.cell_size = cfg.CELL_SIZE
self.boxes_per_cell = cfg.BOXES_PER_CELL
self.threshold = cfg.THRESHOLD
self.iou_threshold = cfg.IOU_THRESHOLD
self.boundary1 = self.cell_size * self.cell_size * self.num_class
self.boundary2 = self.boundary1 +\
self.cell_size * self.cell_size * self.boxes_per_cell
self.sess = tf.Session()
self.sess.run(tf.global_variables_initializer())
print('Restoring weights from: ' + self.weights_file)
self.saver = tf.train.Saver()
self.saver.restore(self.sess, self.weights_file)
def draw_result(self, img, result): #画出bbox
for i in range(len(result)):
x = int(result[i][1])
y = int(result[i][2])
w = int(result[i][3] / 2)
h = int(result[i][4] / 2)
cv2.rectangle(img, (x - w, y - h), (x + w, y + h), (0, 255, 0), 2)
cv2.rectangle(img, (x - w, y - h - 20),
(x + w, y - h), (125, 125, 125), -1)
lineType = cv2.LINE_AA if cv2.__version__ > '3' else cv2.CV_AA
cv2.putText(
img, result[i][0] + ' : %.2f' % result[i][5],
(x - w + 5, y - h - 7), cv2.FONT_HERSHEY_SIMPLEX, 0.5,
(0, 0, 0), 1, lineType) #在框上写出类别名称
def detect(self, img): #将预测集上预测的转换回原图像中
#图片数据预处理
img_h, img_w, _ = img.shape
inputs = cv2.resize(img, (self.image_size, self.image_size))
inputs = cv2.cvtColor(inputs, cv2.COLOR_BGR2RGB).astype(np.float32)
inputs = (inputs / 255.0) * 2.0 - 1.0 #(-1-1)
inputs = np.reshape(inputs, (1, self.image_size, self.image_size, 3))#变成四维
result = self.detect_from_cvmat(inputs)[0] #只有一张图片
for i in range(len(result)): #同时存放多个类型的物体
result[i][1] *= (1.0 * img_w / self.image_size) #位置缩放 坐标的尺寸 原图于标准化的缩放变化
result[i][2] *= (1.0 * img_h / self.image_size)
result[i][3] *= (1.0 * img_w / self.image_size)
result[i][4] *= (1.0 * img_h / self.image_size)
return result
def detect_from_cvmat(self, inputs): #预测测试集的结果
net_output = self.sess.run(self.net.logits,
feed_dict={self.net.images: inputs})
results = []
for i in range(net_output.shape[0]): #批次处理图片数量 net_output = None*1470
results.append(self.interpret_output(net_output[i])) #一次处理一个卷积结果
return results #(sum(batch)xnum)x(class,xc,yc,w,h,score)
def interpret_output(self, output): #将预测结果转换,变成对应图像上的实际长宽,以及总共预测有多少个类别,返回预测结果
#将各维信息提取出来
probs = np.zeros((self.cell_size, self.cell_size,
self.boxes_per_cell, self.num_class))
class_probs = np.reshape(
output[0:self.boundary1],
(self.cell_size, self.cell_size, self.num_class))
scales = np.reshape(
output[self.boundary1:self.boundary2],
(self.cell_size, self.cell_size, self.boxes_per_cell))
boxes = np.reshape(
output[self.boundary2:],
(self.cell_size, self.cell_size, self.boxes_per_cell, 4))
offset = np.array(
[np.arange(self.cell_size)] * self.cell_size * self.boxes_per_cell)
offset = np.transpose(
np.reshape(
offset,
[self.boxes_per_cell, self.cell_size, self.cell_size]),
(1, 2, 0))
boxes[:, :, :, 0] += offset #加上x方向的Xcell偏置,就是还原回原图大小的位置
boxes[:, :, :, 1] += np.transpose(offset, (1, 0, 2)) #y也一样
boxes[:, :, :, :2] = 1.0 * boxes[:, :, :, 0:2] / self.cell_size #得到真实的(xc,yc)=>最后一维为真正计算的量,其他维数为位置上的对应关系
boxes[:, :, :, 2:] = np.square(boxes[:, :, :, 2:]) #之前开方的,平方回来
boxes *= self.image_size
for i in range(self.boxes_per_cell):
for j in range(self.num_class):
probs[:, :, i, j] = np.multiply(
class_probs[:, :, j], scales[:, :, i]) #计算框的类概率 P(c) * P(o|c) = P(o)
#分割阈值,概率大于等于1的保留
filter_mat_probs = np.array(probs >= self.threshold, dtype='bool')#高于阈值的才要,不是0就是1
filter_mat_boxes = np.nonzero(filter_mat_probs) #不为0的记录
boxes_filtered = boxes[filter_mat_boxes[0],
filter_mat_boxes[1], filter_mat_boxes[2]] #前三行要筛选,筛选所有非零元素值,返回二维数组
probs_filtered = probs[filter_mat_probs] #筛选出所有非0元素值 一维list
classes_num_filtered = np.argmax( #返回轴3上值为1的三位索引,构成数组的位置
filter_mat_probs, axis=3)[
filter_mat_boxes[0], filter_mat_boxes[1], filter_mat_boxes[2]]
argsort = np.array(np.argsort(probs_filtered))[::-1] #按照概率从大到小排列
boxes_filtered = boxes_filtered[argsort]
probs_filtered = probs_filtered[argsort]
classes_num_filtered = classes_num_filtered[argsort]
for i in range(len(boxes_filtered)): #nms非极大值抑制
if probs_filtered[i] == 0:
continue
for j in range(i + 1, len(boxes_filtered)):
if self.iou(boxes_filtered[i], boxes_filtered[j]) > self.iou_threshold: #计算两个框的重合度,如果重合度超过阈值就舍弃
probs_filtered[j] = 0.0 #避免重复操作,从第一个开始递归
filter_iou = np.array(probs_filtered > 0.0, dtype='bool') #最后筛选出来的结果,有几个非零值,就有几个目标物体
boxes_filtered = boxes_filtered[filter_iou] #得到目标的box
probs_filtered = probs_filtered[filter_iou] #得到目标的得分
classes_num_filtered = classes_num_filtered[filter_iou] #得到目标的class
'''
上述将高维矩阵中非零元素转化为低维数组的方法实例如下
import numpy as np
a = np.array([[0,0,3],[0,0,0],[0,0,9]])
b = np.nonzero(a) #output:(array([0,2],dtype=int64),array([2,2]),dtype=int64),将三维数组的非0元素转成了一维数组输出了
b[0] #output:(array[0,2],dtype=int64)
c = a>0 #output:array([false,false,true],
# [false,false,false],
# [false,false,true])
a[c] #转成了低维输出 array([3,9])
a[b] #和a[c]结果一样 array([3,9])
'''
result = []
for i in range(len(boxes_filtered)):
result.append(
[self.classes[classes_num_filtered[i]],
boxes_filtered[i][0],
boxes_filtered[i][1],
boxes_filtered[i][2],
boxes_filtered[i][3],
probs_filtered[i]])
return result #返回num X (class,xc,yc,w,h,score)
def iou(self, box1, box2): #计算重合区域
tb = min(box1[0] + 0.5 * box1[2], box2[0] + 0.5 * box2[2]) - \
max(box1[0] - 0.5 * box1[2], box2[0] - 0.5 * box2[2])
lr = min(box1[1] + 0.5 * box1[3], box2[1] + 0.5 * box2[3]) - \
max(box1[1] - 0.5 * box1[3], box2[1] - 0.5 * box2[3])
inter = 0 if tb < 0 or lr < 0 else tb * lr
return inter / (box1[2] * box1[3] + box2[2] * box2[3] - inter)
def camera_detector(self, cap, wait=10): #计算检测时间,拍照功能
detect_timer = Timer()
ret, _ = cap.read()
while ret:
ret, frame = cap.read()
detect_timer.tic()
result = self.detect(frame)
detect_timer.toc()
print('Average detecting time: {:.3f}s'.format(
detect_timer.average_time))
self.draw_result(frame, result)
cv2.imshow('Camera', frame)
cv2.waitKey(wait)
ret, frame = cap.read()
def image_detector(self, imname, wait=0): #检测直接用图片的检测时间
detect_timer = Timer()
image = cv2.imread(imname)
detect_timer.tic()
result = self.detect(image)
detect_timer.toc()
print('Average detecting time: {:.3f}s'.format(
detect_timer.average_time))
self.draw_result(image, result)
cv2.imshow('Image', image)
cv2.waitKey(wait)
def main():
parser = argparse.ArgumentParser()
parser.add_argument('--weights', default="YOLO_small.ckpt", type=str)
parser.add_argument('--weight_dir', default='weights', type=str)
parser.add_argument('--data_dir', default="data", type=str)
parser.add_argument('--gpu', default='', type=str)
args = parser.parse_args()
os.environ['CUDA_VISIBLE_DEVICES'] = args.gpu
yolo = YOLONet(False) #False--测试,True--训练
weight_file = os.path.join(args.data_dir, args.weight_dir, args.weights) #拿出训练好的权重文件
detector = Detector(yolo, weight_file)
# detect from camera #实时拍照检测时用
# cap = cv2.VideoCapture(-1)
# detector.camera_detector(cap)
# detect from image file #拿图片来测试
imname = 'test/person.jpg'
detector.image_detector(imname)
if __name__ == '__main__':
main()