使用如下命令,一键训练出一个和YOLO效果差不多的目标检测网络吧。
CUDA_VISIBLE_DEVICES=‘0’ python train.py
年轻人,就是这么不讲武德。
当然,你也别闲着,task4的学习内容,对应《动手学CV-Pytorch》3.6节:
训练与测试.
学习任务:
完成我们的简化目标检测网络Tiny_Detector的训练
对训练好的网络进行评测
训练过程中,多读读代码,强化记忆本章介绍的内容
如果你有精力,不妨思考下如何改进下网络,并动手实践。
蓝色部分为记录的笔记
前面的章节,我们已经对目标检测训练的各个重要的知识点进行了讲解,下面我们需要将整个流程串起来,对模型进行训练。
目标检测网络的训练大致是如下的流程:
首先,我们导入必要的库,然后设定各种超参数
import time
import torch.backends.cudnn as cudnn
import torch.optim
import torch.utils.data
from model import tiny_detector, MultiBoxLoss
from datasets import PascalVOCDataset
from utils import *
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
cudnn.benchmark = True
# Data parameters
data_folder = '../../../dataset/VOCdevkit' # data files root path
keep_difficult = True # use objects considered difficult to detect?
n_classes = len(label_map) # number of different types of objects
# Learning parameters
total_epochs = 230 # number of epochs to train
batch_size = 32 # batch size
workers = 4 # number of workers for loading data in the DataLoader
print_freq = 100 # print training status every __ batches
lr = 1e-3 # learning rate
decay_lr_at = [150, 190] # decay learning rate after these many epochs
decay_lr_to = 0.1 # decay learning rate to this fraction of the existing learning rate
momentum = 0.9 # momentum
weight_decay = 5e-4 # weight decay
decay_lr_at = [150,190]
是在epoch为150和190的时候学习率乘decay_lr_to
,[150,190]是两个值,不是一个范围,于是在epoch为190过后,学习率为原来学习率的0.01
weight_decay
衰减权重,正则化的意思
按照上面梳理的流程,编写训练代码如下:
def main():
"""
Training.
"""
# Initialize model and optimizer
model = tiny_detector(n_classes=n_classes)
criterion = MultiBoxLoss(priors_cxcy=model.priors_cxcy)
optimizer = torch.optim.SGD(params=model.parameters(),
lr=lr,
momentum=momentum,
weight_decay=weight_decay)
# Move to default device
model = model.to(device)
criterion = criterion.to(device)
# Custom dataloaders
train_dataset = PascalVOCDataset(data_folder,
split='train',
keep_difficult=keep_difficult)
train_loader = torch.utils.data.DataLoader(train_dataset,
batch_size=batch_size,
shuffle=True,
collate_fn=train_dataset.collate_fn,
num_workers=workers,
pin_memory=True)
# Epochs
for epoch in range(total_epochs):
# Decay learning rate at particular epochs
if epoch in decay_lr_at:
adjust_learning_rate(optimizer, decay_lr_to)
# One epoch's training
train(train_loader=train_loader,
model=model,
criterion=criterion,
optimizer=optimizer,
epoch=epoch)
# Save checkpoint
save_checkpoint(epoch, model, optimizer)
很常规的训练过程,包含了学习率降低以及保存模型及参数
其中,我们对单个epoch的训练逻辑进行了封装,其具体实现如下:
def train(train_loader, model, criterion, optimizer, epoch):
"""
One epoch's training.
:param train_loader: DataLoader for training data
:param model: model
:param criterion: MultiBox loss
:param optimizer: optimizer
:param epoch: epoch number
"""
model.train() # training mode enables dropout
batch_time = AverageMeter() # forward prop. + back prop. time
data_time = AverageMeter() # data loading time
losses = AverageMeter() # loss
start = time.time()
# Batches
for i, (images, boxes, labels, _) in enumerate(train_loader):
data_time.update(time.time() - start)
# Move to default device
images = images.to(device) # (batch_size (N), 3, 224, 224)
boxes = [b.to(device) for b in boxes]
labels = [l.to(device) for l in labels]
# Forward prop.
predicted_locs, predicted_scores = model(images) # (N, 441, 4), (N, 441, n_classes)
# Loss
loss = criterion(predicted_locs, predicted_scores, boxes, labels) # scalar
# Backward prop.
optimizer.zero_grad()
loss.backward()
# Update model
optimizer.step()
losses.update(loss.item(), images.size(0))
batch_time.update(time.time() - start)
start = time.time()
# Print status
if i % print_freq == 0:
print('Epoch: [{0}][{1}/{2}]\t'
'Batch Time {batch_time.val:.3f} ({batch_time.avg:.3f})\t'
'Data Time {data_time.val:.3f} ({data_time.avg:.3f})\t'
'Loss {loss.val:.4f} ({loss.avg:.4f})\t'.format(epoch,
i,
len(train_loader),
batch_time=batch_time,
data_time=data_time,
loss=losses))
del predicted_locs, predicted_scores, images, boxes, labels # free some memory since their histories may be stored
来看下AverageMeter()
这个函数主要是用来记录信息的,这里记录了时间和损失
class AverageMeter(object):
"""
Keeps track of most recent, average, sum, and count of a metric.
"""
def __init__(self):
self.reset()
def reset(self):
self.val = 0
self.avg = 0
self.sum = 0
self.count = 0
def update(self, val, n=1):
self.val = val
self.sum += val * n
self.count += n
self.avg = self.sum / self.count
当记录时间时候,update
中,输入参数n=1
,记录损失的时候,输入参数为n=images.size(0)
,十分简洁,用一个函数封装了起来,就完成了对时间和损失的记录。
接下来看下模型保存
def save_checkpoint(epoch, model, optimizer):
"""
Save model checkpoint.
:param epoch: epoch number
:param model: model
:param optimizer: optimizer
"""
state = {
'epoch': epoch,
'model': model,
'optimizer': optimizer}
filename = 'checkpoint.pth.tar'
torch.save(state, filename)
这种方式,不仅保存了,model还保存了epoch和optimizer,放在一个字典中。知道了这样的结构,我们就可以使用了。
有时候电脑性能不行,不能长时间工作,需要保存下来,下次再加载运行,我补充了下面一行,针对自己的笔记本运行(笔记本跑程序,有点迫害机器的味道)。
start_epoch = 0
if os.path.exists('checkpoint.pth.tar'):
checkpoint = torch.load('checkpoint.pth.tar')
model = checkpoint["model"]
start_epoch = checkpoint["epoch"]+1
optimizer = checkpoint["optimizer"]
这段在main函数中加载dataloader后放进去
这样就可以随意的中断和继续了。
为什么start_epoch+1呢?
因为比如在epoch=5的时候完成了训练,正在epoch=6的训练时候中断,如果此时读取,epoch=5,但实际上要从epoch=6开始,所以加1。
完成了代码的编写后,我们就可以开始训练模型了,训练过程类似下图所示:
0.399s一个batch就是程序中的每个i, i每到达100的倍数才输出一次,所以实际上输出的间隔为39.9s
算了下总共训练完成要花13.5小时。
剩下的就是等待了~
之前我们的提到过,模型不是直接预测的目标框信息,而是预测的基于anchor的偏移,且经过了编码。因此后处理的第一步,就是对模型的回归头的输出进行解码,拿到真正意义上的目标框的预测结果。
后处理还需要做什么呢?由于我们预设了大量的先验框,因此预测时在目标周围会形成大量高度重合的检测框,而我们目标检测的结果只希望保留一个足够准确的预测框,所以就需要使用某些算法对检测框去重。这个去重算法叫做NMS,下面我们详细来讲一讲。
NMS的大致算法步骤如下:
先利用置信度阈值筛选一些置信度很低的框(可以把大部分无关的框给删除),然后当前概率最高的框作为候选框,这个候选框与其他候选框的IOU高于一个阈值的框就要删去
因为,相互iou高的肯定一个堆,如下图,有3个堆,堆中的框iou互相很高,黑框中,选取一个置信度最大的即可,其他都删除,因为它们非常近,用一个置信度大的替换这个堆就行了。
然后对剩下的框排序,选取iou最大值的框,再进行抑制,重复这个过程。直到剩下的框,只剩下一个了,就选取它
这些选取得框,就是目标框,多个框就是多个目标
然后每个堆都执行这样的操作,完成后如下图
这样子就完成NMS的步骤。
当然如果框间距离很近,很可能只剩下一个框,就认为该图中,某个分类目标只有1个
还记得第一步利用置信度阈值筛选一些置信度很低的框吗?当筛选完了,没有框了会怎么样?
直接给标签0,当作backgroud,后面detect_objects
就是这么操作的
但是对于我们的多类别目标检测,这样就完了嘛?对于人脸检测这种来说,确实结束了。但是别忘了我们要20分类(backgroud除外),我们目前仅仅完成了对1个分类的NMS,接下来还有19次这样的循环(19次NMS)…然后把框全部呈现到原图中。
整个后处理过程的代码实现位于model.py
中tiny_detector
类的detect_objects
函数中
首先要知道我们要做什么,给一些图,然后在那些图上画出目标框,这便是我们的最终目的。
想要画出框,首先要找到框,这一步detect_objects
就是用来找出给定图片的中目标,以及它的标签,置信度。
想要看明白这些代码还是挺费劲的,下面先展示下代码,后续给出解释。
def detect_objects(self, predicted_locs, predicted_scores, min_score, max_overlap, top_k):
"""
Decipher the 441 locations and class scores (output of the tiny_detector) to detect objects.
For each class, perform Non-Maximum Suppression (NMS) on boxes that are above a minimum threshold.
:param predicted_locs: predicted locations/boxes w.r.t the 441 prior boxes, a tensor of dimensions (N, 441, 4)
:param predicted_scores: class scores for each of the encoded locations/boxes, a tensor of dimensions (N, 441, n_classes)
:param min_score: minimum threshold for a box to be considered a match for a certain class
:param max_overlap: maximum overlap two boxes can have so that the one with the lower score is not suppressed via NMS
:param top_k: if there are a lot of resulting detection across all classes, keep only the top 'k'
:return: detections (boxes, labels, and scores), lists of length batch_size
"""
batch_size = predicted_locs.size(0)
n_priors = self.priors_cxcy.size(0)
predicted_scores = F.softmax(predicted_scores, dim=2) # (N, 441, n_classes)
# Lists to store final predicted boxes, labels, and scores for all images in batch
all_images_boxes = list()
all_images_labels = list()
all_images_scores = list()
assert n_priors == predicted_locs.size(1) == predicted_scores.size(1)
for i in range(batch_size):
# Decode object coordinates from the form we regressed predicted boxes to
decoded_locs = cxcy_to_xy(
gcxgcy_to_cxcy(predicted_locs[i], self.priors_cxcy)) # (441, 4), these are fractional pt. coordinates
# Lists to store boxes and scores for this image
image_boxes = list()
image_labels = list()
image_scores = list()
max_scores, best_label = predicted_scores[i].max(dim=1) # (441)
# Check for each class
for c in range(1, self.n_classes):
# Keep only predicted boxes and scores where scores for this class are above the minimum score
class_scores = predicted_scores[i][:, c] # (441)
score_above_min_score = class_scores > min_score # torch.uint8 (byte) tensor, for indexing
n_above_min_score = score_above_min_score.sum().item()
if n_above_min_score == 0:
continue
class_scores = class_scores[score_above_min_score] # (n_qualified), n_min_score <= 441
class_decoded_locs = decoded_locs[score_above_min_score] # (n_qualified, 4)
# Sort predicted boxes and scores by scores
class_scores, sort_ind = class_scores.sort(dim=0, descending=True) # (n_qualified), (n_min_score)
class_decoded_locs = class_decoded_locs[sort_ind] # (n_min_score, 4)
# Find the overlap between predicted boxes
overlap = find_jaccard_overlap(class_decoded_locs, class_decoded_locs) # (n_qualified, n_min_score)
# Non-Maximum Suppression (NMS)
# A torch.uint8 (byte) tensor to keep track of which predicted boxes to suppress
# 1 implies suppress, 0 implies don't suppress
suppress = torch.zeros((n_above_min_score), dtype=torch.uint8).to(device) # (n_qualified)
# Consider each box in order of decreasing scores
for box in range(class_decoded_locs.size(0)):
# If this box is already marked for suppression
if suppress[box] == 1:
continue
# Suppress boxes whose overlaps (with current box) are greater than maximum overlap
# Find such boxes and update suppress indices
suppress = torch.max(suppress, (overlap[box] > max_overlap).to(torch.uint8))
# The max operation retains previously suppressed boxes, like an 'OR' operation
# Don't suppress this box, even though it has an overlap of 1 with itself
suppress[box] = 0
# Store only unsuppressed boxes for this class
image_boxes.append(class_decoded_locs[1 - suppress])
image_labels.append(torch.LongTensor((1 - suppress).sum().item() * [c]).to(device))
image_scores.append(class_scores[1 - suppress])
# If no object in any class is found, store a placeholder for 'background'
if len(image_boxes) == 0:
image_boxes.append(torch.FloatTensor([[0., 0., 1., 1.]]).to(device))
image_labels.append(torch.LongTensor([0]).to(device))
image_scores.append(torch.FloatTensor([0.]).to(device))
# Concatenate into single tensors
image_boxes = torch.cat(image_boxes, dim=0) # (n_objects, 4)
image_labels = torch.cat(image_labels, dim=0) # (n_objects)
image_scores = torch.cat(image_scores, dim=0) # (n_objects)
n_objects = image_scores.size(0)
# Keep only the top k objects
if n_objects > top_k:
image_scores, sort_ind = image_scores.sort(dim=0, descending=True)
image_scores = image_scores[:top_k] # (top_k)
image_boxes = image_boxes[sort_ind][:top_k] # (top_k, 4)
image_labels = image_labels[sort_ind][:top_k] # (top_k)
# Append to lists that store predicted boxes and scores for all images
all_images_boxes.append(image_boxes)
all_images_labels.append(image_labels)
all_images_scores.append(image_scores)
return all_images_boxes, all_images_labels, all_images_scores # lists of length batch_size
我们的后处理代码中NMS的部分着实有些绕,大家可以参考下Fast R-CNN中的NMS实现,更简洁清晰一些
# --------------------------------------------------------
# Fast R-CNN
# Copyright (c) 2015 Microsoft
# Licensed under The MIT License [see LICENSE for details]
# Written by Ross Girshick
# --------------------------------------------------------
import numpy as np
# dets: 检测的 boxes 及对应的 scores;
# thresh: 设定的阈值
def nms(dets,thresh):
# boxes 位置
x1 = dets[:,0]
y1 = dets[:,1]
x2 = dets[:,2]
y2 = dets[:,3]
# boxes scores
scores = dets[:,4]
areas = (x2-x1+1)*(y2-y1+1) # 各box的面积
order = scores.argsort()[::-1] # 分类置信度排序
keep = [] # 记录保留下的 boxes
while order.size > 0:
i = order[0] # score最大的box对应的 index
keep.append(i) # 将本轮score最大的box的index保留
\# 计算剩余 boxes 与当前 box 的重叠程度 IoU
xx1 = np.maximum(x1[i],x1[order[1:]])
yy1 = np.maximum(y1[i],y1[order[1:]])
xx2 = np.minimum(x2[i],x2[order[1:]])
yy2 = np.minimum(y2[i],y2[order[1:]])
w = np.maximum(0.0,xx2-xx1+1) # IoU
h = np.maximum(0.0,yy2-yy1+1)
inter = w*h
ovr = inter/(areas[i]+areas[order[1:]]-inter)
\# 保留 IoU 小于设定阈值的 boxes
inds = np.where(ovr<=thresh)[0]
order = order[inds+1]
return keep
detect_objects
逐行解释先明白下输入的参数是什么东西?
predicted_locs
形状为 (N, 441, 4)的tensor ,每个先验框都有一个输出的目标框,其中N为图片数量,因为我们输入的时候是一批图片,441是先验框的数量,4为模型输出的坐标,格式为 ( g c x , g c y , g w , g h ) (g_{cx},g_{cy},g_{w},g_{h}) (gcx,gcy,gw,gh),后续要解码
predicted_scores
形状为(N,441,21)的tensor, 21装的是各分类的置信度,每个先验框需要有对应的分类(根据置信度)
min_score
超过该得分的才会被认为是属于这个分类的,相当于置信度阈值
max_overlap
超过该阈值,则会被NMS抑制
top_k
如果一张图中目标很多,就只保留k个
batch_size = predicted_locs.size(0)
n_priors = self.priors_cxcy.size(0)
predicted_scores = F.softmax(predicted_scores, dim=2) # (N, 441, n_classes)
predicted_locs
形状为(N,441,4),因此batch_size
值为N(图片数量,到时候是一张一张用循环处理的)
self.priors_cxcy
形状为(441,4),因此n_priors
为值441
第三行softmax,增加置信度之间的差异,大的更大,小的更小,有利于后续筛选
assert n_priors == predicted_locs.size(1) == predicted_scores.size(1)
检验程序是否正确运行,否则报错
编号1
for i in range(batch_size):
# Decode object coordinates from the form we regressed predicted boxes to#解码
decoded_locs = cxcy_to_xy(
gcxgcy_to_cxcy(predicted_locs[i], self.priors_cxcy)) # (441, 4), these are fractional pt. coordinates
# Lists to store boxes and scores for this image
image_boxes = list()
image_labels = list()
image_scores = list()
这是一个大循环,每次处理一张图片,boxes的坐标被解码成xy形式了
max_scores, best_label = predicted_scores[i].max(dim=1)
这行代码应该是错误的部分,直接忽视就行了
可以看见是灰色的,后续没有用到。
编号2
for c in range(1, self.n_classes):
下面的内容都是针对每一个类别来说了
每次取一个分类,因为我们要分别对每个置信度,分别用置信度排序
class_scores = predicted_scores[i][:, c] # (441)
predicted_scores是(N,441,21)形状,predicted_scores[i][:, c]挑出了,441个先验框,查看第c个分类的置信度。因此形状为(441),每个装的是对c分类的置信度
score_above_min_score = class_scores > min_score
n_above_min_score = score_above_min_score.sum().item()
if n_above_min_score == 0:
continue
通过置信度阈值,筛选该分类(c)置信度低的图,n_above_min_score
是大于置信度先验框的数量(别忘了我们是对441个先验框操作,删选后小于441个)
如果先验框属于类别c的数量为0,那么进行下一轮循环,操作下一个类别,回到编号2
score_above_min_score
为数据类型为bool类型,后续用来取索引
class_scores = class_scores[score_above_min_score] # (n_qualified), n_min_score <= 441
class_decoded_locs = decoded_locs[score_above_min_score] # (n_qualified, 4)
# Sort predicted boxes and scores by scores
class_scores, sort_ind = class_scores.sort(dim=0, descending=True) # (n_qualified),
class_decoded_locs = class_decoded_locs[sort_ind] # (n_min_score, 4)
用索引提取出,满足置信度条件的属于类别c的先验框的置信度class_scores
和框class_decoded_locs
然后对置信度进行排序,得到排序结束的是置信度和索引(索引用来进一步对目标框的坐标排序)
overlap = find_jaccard_overlap(class_decoded_locs, class_decoded_locs)# (n_qualified, n_min_score)
注释中,n_qualified(合格的框,也就是大于阈值置信度的先验框数量)与n_min_score(大于最小置信度的先验框数量)是相等的,注意一个先验框对应一个目标框,iou是目标框在作,不是先验框
这一行代码为了计算,这个类别c的目标框的,两两iou计算
下面进行非极大值抑制,去除重复的目标框,suppress(n_qualified)中值为1则代表抑制
suppress = torch.zeros((n_above_min_score), dtype=torch.uint8).to(device) # (n_qualified)
先创建suppress,值全0,如果为1则代表该先验框抑制
for box in range(class_decoded_locs.size(0)): #class_decoded_locs.size(0)就是n_qualified的值
# If this box is already marked for suppression
if suppress[box] == 1:
continue
# Suppress boxes whose overlaps (with current box) are greater than maximum overlap
# Find such boxes and update suppress indices
suppress = torch.max(suppress, (overlap[box] > max_overlap).to(torch.uint8))
# The max operation retains previously suppressed boxes, like an 'OR' operation
# Don't suppress this box, even though it has an overlap of 1 with itself
suppress[box] = 0
用配图解释一下该过程,假设:已经经过置信度阈值处理,现在只剩下3个目标框
,我们给他编号成1、2、3,for box in range(class_decoded_locs.size(0))
每次对取一个目标框,依次与其他框执行程序
先对框1(box):
刚开始suppress
= [0,0,0],如overlap[box]
=[1,0.6,0.55] (对自身的iou=1)
(overlap[box] > max_overlap).to(torch.uint8)
的值为[1,1,1]
此步过后
suppress = torch.max(suppress, (overlap[box] > max_overlap).to(torch.uint8))
suppress=[1,1,1]
因此自己与自己的iou=1,不能把自己给抑制了,所以要抑制要设置为0
于是
suppress[box] = 0
第一轮循环结束,记住supress=[0,1,1]
对框2(box):
由于supress=[0,1,1]
if suppress[box] == 1:
continue
正好为1,直接跳过,进入下一轮循环
对框3(box):
也直接跳过
那么,我们就只剩下,1个框没有抑制了,其他都被我们抑制了
当然我们也可以找到两个目标框(或者更多),上面这张图经过NMS后,会留下2个框(2个目标)
这种例子下suppress = [0,1,0]
当前类别c下,找到n个目标,下面用n表示
image_boxes.append(class_decoded_locs[1 - suppress])#(n,4)
image_labels.append(torch.LongTensor((1 - suppress).sum().item() * [c]).to(device))#(n)
image_scores.append(class_scores[1 - suppress])#(n)
现在才完成对一个分类( c )的目标检测,总共有20个,(backgroud不单独去检测)
回到编号2,对下一个分类进行目标检测
torch.LongTensor((1 - suppress).sum().item() * [c]).to(device)
这行的输出结果是[c,c,c,c,c,…]共n个c的列表
# If no object in any class is found, store a placeholder for 'background'
if len(image_boxes) == 0:
image_boxes.append(torch.FloatTensor([[0., 0., 1., 1.]]).to(device))
image_labels.append(torch.LongTensor([0]).to(device))
image_scores.append(torch.FloatTensor([0.]).to(device))
如果没有目标,就是直接框整个背景,当作分类0(background)
# Concatenate into single tensors
image_boxes = torch.cat(image_boxes, dim=0) # (n_objects, 4)
image_labels = torch.cat(image_labels, dim=0) # (n_objects)
image_scores = torch.cat(image_scores, dim=0) # (n_objects)
n_objects = image_scores.size(0)
注意这里的torch.cat与之前的torch.stack用法不一样。(同Task01: 两个年轻人-目标检测基础和VOC数据集 打卡笔记中的不一样)
当一个列表中,装有N个形状相同的tensor,如(n,4)那么在第0维度torch.cat,会变成(N,n,4)的tensor
但在这里,显然,(n,4),对于每个类别( c )的检测到的目标数量不同,因此一个列表中装的是N个( n c n_{c} nc,4)的tensor,其中 n c n_{c} nc与类别c有关,对这个列表在第0维度torch.cat,会变成形状为(n_object, 4)的tensor
其中 n _ o b j e c t = ∑ c = 1 20 n c n\_object\ =\ \sum_{c=1}^{20}{n_c} n_object = c=1∑20nc
if n_objects > top_k:
image_scores, sort_ind = image_scores.sort(dim=0, descending=True)
image_scores = image_scores[:top_k] # (top_k)
image_boxes = image_boxes[sort_ind][:top_k] # (top_k, 4)
image_labels = image_labels[sort_ind][:top_k] # (top_k)
若图片中目标大于指定数目的,就只取top_k
个
all_images_boxes.append(image_boxes)
all_images_labels.append(image_labels)
all_images_scores.append(image_scores)
存储单张图片的目标框,标签,置信度
终于完成了对一张图片的目标检测,接下来就可以回到编号1对,第二张、…第N张图的检测
return all_images_boxes, all_images_labels, all_images_scores # lists of length batch_size
最后得到了我们想要的东西
注意:
all_images_boxes是长度为N的列表,里面为tensor,tensor形状为(n_object,4),其中n_object视图片而定,不是固定的
十分艰难地完成了对detect_objects
理解,确实在代码上比用np实现复杂了很多,但也是一次很好的阅读代码能力的提升机会。
下面我们来看下如何对单张图片进行推理,得到目标检测结果。
首先我们需要导入必要的python包,然后加载训练好的模型权重。
随后我们需要定义预处理函数。为了达到最好的预测效果,测试环节的预处理方案需要和训练时保持一致,仅去除掉数据增强相关的变换即可。
因此,这里我们需要进行的预处理为:
# Set detect transforms (It's important to be consistent with training)
resize = transforms.Resize((224, 224))
to_tensor = transforms.ToTensor()
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
接着我们就来进行推理,过程很简单,核心流程可以概括为:
核心代码如下:
# Transform the image
image = normalize(to_tensor(resize(original_image)))
# Move to default device
image = image.to(device)
# Forward prop.
predicted_locs, predicted_scores = model(image.unsqueeze(0))
# Post process, get the final detect objects from our tiny detector output
det_boxes, det_labels, det_scores = model.detect_objects(predicted_locs, predicted_scores, min_score=min_score, max_overlap=max_overlap, top_k=top_k)
这里的detect_objects
函数完成模型预测结果的后处理,主要工作有两个,首先对模型的输出进行解码,得到代表具体位置信息的预测框,随后对所有预测框按类别进行NMS,来过滤掉一些多余的检测框,也就是我们上一小节介绍的内容。
detect.py
def detect(original_image, min_score, max_overlap, top_k):
"""
Detect objects in an image with a trained tiny object detector, and visualize the results.
:param original_image: image, a PIL Image
:param min_score: minimum threshold for a detected box to be considered a match for a certain class
:param max_overlap: maximum overlap two boxes can have so that the one with the lower score is not suppressed via Non-Maximum Suppression (NMS)
:param top_k: if there are a lot of resulting detection across all classes, keep only the top 'k'
:return: annotated image, a PIL Image
"""
# Transform the image
image = normalize(to_tensor(resize(original_image)))
# Move to default device
image = image.to(device)
# Forward prop.
predicted_locs, predicted_scores = model(image.unsqueeze(0))
# Post process, get the final detect objects from our tiny detector output#
det_boxes, det_labels, det_scores = model.detect_objects(predicted_locs, predicted_scores, min_score=min_score,
max_overlap=max_overlap, top_k=top_k)
# Move detect results to the CPU
det_boxes = det_boxes[0].to('cpu')
det_labels = det_labels[0].to('cpu').tolist()
# Transform det_boxes to original image dimensions
original_dims = torch.FloatTensor(
[original_image.width, original_image.height, original_image.width, original_image.height]).unsqueeze(0)
det_boxes = det_boxes * original_dims# 将归一化的(x1,y1,x2,y2)还原得到在原图中的真实目标框坐标
# Decode class integer labels, for example: 12 -> dog, 15 -> person #rev_label_map这个字典在utils中
det_labels = [rev_label_map[l] for l in det_labels]
# If no objects found, the detected labels will be set to ['0.']
# you can find detail in tiny_detector.detect_objects() in model.py
if det_labels == ['background']:
# Just return original image
return original_image
# Annotate detect result on original image
annotated_image = original_image#打算在原图上开始画了框和标签了,下面都是PIL的画图操作了
draw = ImageDraw.Draw(annotated_image)
font = ImageFont.load_default()
for i in range(det_boxes.size(0)):
box_location = det_boxes[i].tolist()
# draw detect box
draw.rectangle(xy=box_location, outline=label_color_map[det_labels[i]])
draw.rectangle(xy=[l + 1. for l in box_location], outline=label_color_map[det_labels[i]])
# a second rectangle at an offset of 1 pixel to increase line thickness
# draw label Text
text_size = font.getsize(det_labels[i].upper())
text_location = [box_location[0] + 2., box_location[1] - text_size[1]]
textbox_location = [box_location[0], box_location[1] - text_size[1],
box_location[0] + text_size[0] + 4., box_location[1]]
draw.rectangle(xy=textbox_location, fill=label_color_map[det_labels[i]])
draw.text(xy=text_location, text=det_labels[i].upper(), fill='white', font=font)
del draw
return annotated_image
首先看看输入了什么
def detect(original_image, min_score, max_overlap, top_k):
原图(正常尺寸,没任何处理过,PIL直接打开的结果)
min_score是最小置信度
max_overlapNMS时候的最大NMS
top_k最多显示的目标数量
# Transform the image
image = normalize(to_tensor(resize(original_image)))
# Move to default device
image = image.to(device)
# Forward prop.
predicted_locs, predicted_scores = model(image.unsqueeze(0))
因为传入的按理说是一批的,但是我们每次单张目标检测,因此升维度成(1,C,H,W)
det_boxes, det_labels, det_scores = model.detect_objects(predicted_locs, predicted_scores, min_score=min_score,
max_overlap=max_overlap, top_k=top_k)
这里输出的det_boxes
是长度为N的list,里面是形状为(n_object,4)的tensor
det_labels
是长度为N的list,里面是形状为(n_object)的tensor,值为数字(0~20)
det_scores
是长为N的list,里面是形状为(n_object)的tensor,值为置信度0~1之间
在看下面程序前先要明白:
作单图的目标检测,直接索引[0]取出第一张图的信息(在if __name__ == '__main__'
:中,我们也只传入了一张图)
一定要明白,这里的boxes
的值是什么样的,
decoded_locs = cxcy_to_xy(gcxgcy_to_cxcy(predicted_locs[i], self.priors_cxcy))
gcxgcy_to_cxcy(predicted_locs[i], self.priors_cxcy))
我们进行了解码,得到的是在原图中的经过归一化后的,目标框
cxcy_to_xy()
后我们得到原图中的归一化后的目标框,坐标为(x1,y1,x2,y2)这种,是否记得在transform
函数中有resize
这步,将目标框的长宽用old_dim
归一化了,现在我们就要还原回去
det_boxes = det_boxes[0].to('cpu')
det_labels = det_labels[0].to('cpu').tolist()
# Transform det_boxes to original image dimensions
original_dims = torch.FloatTensor(
[original_image.width, original_image.height, original_image.width, original_image.height]).unsqueeze(0)
det_boxes = det_boxes * original_dims# 将归一化的(x1,y1,x2,y2)还原得到在原图中的真实目标框坐标
# Decode class integer labels, for example: 12 -> dog, 15 -> person #rev_label_map这个字典在utils中
det_labels = [rev_label_map[l] for l in det_labels] #把数字转成英文标签(终于明白,当初它为什么倒序label_map了一遍)
后续如果没有框,就返回原图
再后续就是遍历单图侦测得到得boxes,用PIL画框和画标签了。
对detect.py
理解完了,整个目标检测过程已经有了清晰的思路了。
最后,我们将最终得到的检测框结果进行绘制,得到类似如下图的检测结果:
完整代码见 detect.py
脚本,下面是更多的一些VOC测试集中图片的预测结果展示:
可以看到,我们的 tiny_detector
模型对于一些简单的测试图片检测效果还是不错的。一些更难的图片的预测效果如下:
可以看到,当面对一些稍微有挑战性的图片的时候,我们的检测器就开始暴露出各种个样的问题,包括但不限于:
不妨运行下 detect.py
,赶快看看你训练的模型效果如何吧,你观察到了哪些问题,有没有什么优化思路呢?
于是我们用刚训练出来的模型来试试,按原始的参数,没有调参
发现少了个狗头(似乎与预期有所差距)
总体感觉,对简单物体框得比较准,对于多目标复杂物体,精度不够高,还是有待于提高
以分类模型中最简单的二分类为例,对于这种问题,我们的模型最终需要判断样本的结果是0还是1,或者说是positive还是negative。我们通过样本的采集,能够直接知道真实情况下,哪些数据结果是positive,哪些结果是negative。同时,我们通过用样本数据跑出分类模型的结果,也可以知道模型认为这些数据哪些是positive,哪些是negative。因此,我们就能得到这样四个基础指标,称他们是一级指标(最底层的):
1)真实值是positive,模型认为是positive的数量(True Positive=TP)
2)真实值是positive,模型认为是negative的数量(False Negative = FN):这就是统计学上的第二类错误(Type II Error)
3)真实值是negative,模型认为是positive的数量(False Positive = FP):这就是统计学上的第一类错误(Type I Error)
4)真实值是negative,模型认为是negative的数量(True Negative = TN)
在机器学习领域,混淆矩阵(confusion matrix),又称为可能性表格或错误矩阵。它是一种特定的矩阵用来呈现算法性能的可视化效果,通常用于监督学习(非监督学习,通常用匹配矩阵:matching matrix)。其每一列代表预测值,每一行代表的是实际的类别。这个名字来源于它可以非常容易的表明多个类别是否有混淆(也就是一个class被预测成另一个class)。
二级指标:混淆矩阵里面统计的是个数,有时候面对大量的数据,光凭算个数,很难衡量模型的优劣。因此混淆矩阵在基本的统计结果上又延伸了如下4个指标,我称他们是二级指标(通过最底层指标加减乘除得到的):
1)准确率(Accuracy)-----针对整个模型
2)精确率(Precision)
3)灵敏度(Sensitivity):就是召回率(Recall)
4)特异度(Specificity)
用表格的方式将这四种指标的定义、计算、理解进行汇总:
通过上面的四个二级指标,可以将混淆矩阵中数量的结果转化为0-1之间的比率。便于进行标准化的衡量。
三级指标:这个指标叫做F1 Score。他的计算公式是:
F1 Score = 2PR / P+R
其中,P代表Precision,R代表Recall(召回率)。F1-Score指标综合了Precision与Recall的产出的结果。F1-Score的取值范围从0到1,1代表模型的输出最好,0代表模型的输出结果最差。
AP指标即Average Precision 即平均精确度。
mAP即Mean Average Precision即平均AP值,是对多个验证集个体求平均AP值,作为object detection中衡量检测精度的指标。
在目标检测场景如何计算AP呢,这里需要引出P-R曲线,即以precision和recall作为纵、横轴坐标的二维曲线。通过选取不同阈值时对应的精度和召回率画出,如下图所示:
P-R曲线的总体趋势是,精度越高,召回越低,当召回到达1时,对应概率分数最低的正样本,这个时候正样本数量除以所有大于等于该阈值的样本数量就是最低的精度值。 另外,P-R曲线围起来的面积就是AP值,通常来说一个越好的分类器,AP值越高。
总结:在目标检测中,每一类都可以根据recall和precision绘制P-R曲线,AP就是该曲线下的面积,mAP就是所有类的AP的平均值。(这里说的是VOC数据集的mAP指标的计算方法,COCO数据集的计算方法略有差异)
运行 eval.py
脚本,评估模型在VOC2007测试集上的效果,结果如下:
python eval.py
$ python eval.py
...
...
Evaluating: 100%|███████████████████████████████| 78/78 [00:57<00:00, 1.35it/s]
{
'aeroplane': 0.6086561679840088,
'bicycle': 0.7144593596458435,
'bird': 0.5847545862197876,
'boat': 0.44902321696281433,
'bottle': 0.2160634696483612,
'bus': 0.7212041616439819,
'car': 0.629608154296875,
'cat': 0.8124480843544006,
'chair': 0.3599272668361664,
'cow': 0.5980824828147888,
'diningtable': 0.6459739804267883,
'dog': 0.7577021718025208,
'horse': 0.7861635088920593,
'motorbike': 0.702280580997467,
'person': 0.5821948051452637,
'pottedplant': 0.2793791592121124,
'sheep': 0.5655995607376099,
'sofa': 0.708049476146698,
'train': 0.7575671672821045,
'tvmonitor': 0.5641061663627625}
Mean Average Precision (mAP): 0.602
可以看到,模型的mAP得分为60.2,比经典的YOLO网络的63.4的得分稍低,得分还是说的过去的~
同时,我们也可以观察到,某几个类别,例如bottle
和pottedplant
的检测效果是很差的,说明我们的模型对于小物体,较为密集的物体的检测是存在明显问题的。
对自己的模型也进行mAP评估
得到如下的结果
{
'aeroplane': 0.6288209557533264,
'bicycle': 0.722702145576477,
'bird': 0.5937066078186035,
'boat': 0.4830010235309601,
'bottle': 0.23142732679843903,
'bus': 0.748112678527832,
'car': 0.6528950929641724,
'cat': 0.8354431986808777,
'chair': 0.36506184935569763,
'cow': 0.5821775197982788,
'diningtable': 0.6546671986579895,
'dog': 0.7980186343193054,
'horse': 0.8001546263694763,
'motorbike': 0.7414405345916748,
'person': 0.6052170991897583,
'pottedplant': 0.3198774755001068,
'sheep': 0.5619245171546936,
'sofa': 0.6892690658569336,
'train': 0.7760568261146545,
'tvmonitor': 0.5648713707923889}
Mean Average Precision (mAP): 0.618
evaluate
的代码部分细节目前为止,大部分任务都完成了,但是还有个重要的东西,那就是如何评价这些模型,还剩最后一步了,一气呵成吧。
如果仔细阅读这个代码,其实也没有那么容易。
def evaluate(test_loader, model):
"""
Evaluate.
:param test_loader: DataLoader for test data
:param model: model
"""
# Make sure it's in eval mode
model.eval()
# Lists to store detected and true boxes, labels, scores
det_boxes = list()
det_labels = list()
det_scores = list()
true_boxes = list()
true_labels = list()
true_difficulties = list() # it is necessary to know which objects are 'difficult', see 'calculate_mAP' in utils.py
with torch.no_grad():
# Batches
for i, (images, boxes, labels, difficulties) in enumerate(tqdm(test_loader, desc='Evaluating')):
images = images.to(device) # (N, 3, 300, 300)
# Forward prop.
predicted_locs, predicted_scores = model(images)
# Post process, get the final detect objects from out tiny detector output
det_boxes_batch, det_labels_batch, det_scores_batch = model.detect_objects(predicted_locs,
predicted_scores, min_score=0.01, max_overlap=0.45, top_k=200)
# Evaluation MUST be at min_score=0.01, max_overlap=0.45, top_k=200 for fair comparision with other repos
#比如说一些分类的置信度在很低情况下才能被识别出来
#例如比较A和B的模型,平常我们设置,min_score=0.2,来侦测,但是有一个物体(如:瓶子)这个在min_score=0.1情况下两个模型才能识别出来
#假设A识别瓶子效果比较好,但我们评价的时候也设置min_score=0.2,可以看出模型A和B的mAP得分是一样的,这样就不太公平了,为了保证公平,min_score取小点
# Store this batch's predict results for mAP calculation
det_boxes.extend(det_boxes_batch)
det_labels.extend(det_labels_batch)
det_scores.extend(det_scores_batch)
# Store this batch's ground-truth results for mAP calculation
boxes = [b.to(device) for b in boxes]
labels = [l.to(device) for l in labels]
difficulties = [d.to(device) for d in difficulties]
true_boxes.extend(boxes)
true_labels.extend(labels)
true_difficulties.extend(difficulties)
# Calculate mAP
APs, mAP = calculate_mAP(det_boxes, det_labels, det_scores, true_boxes, true_labels, true_difficulties)
# Print AP for each class
pp.pprint(APs)
print('\nMean Average Precision (mAP): %.3f' % mAP)
整部分内容大概看下来还是比较简洁的,先从test的dataloader获得数据,标签等,后续将数据通过模型,将模型输出的内容和标签都输入 calculate_mAP
中,计算得到mAP
, 输出得到结果。
但是仔细看代码的话细节还是挺多。
det_boxes_batch, det_labels_batch, det_scores_batch = model.detect_objects(predicted_locs, predicted_scores,
min_score=0.01, max_overlap=0.45, top_k=200)
在预测的时候我们设置 min_score=0.2
为啥这边评估变成了 min_score=0.01
了呢?
calculate_mAP
的源码min_score=0.01
了。)结合源码、输出的mAP和模糊的解释,我给出了下面的自己的想法:为啥这边变成了 min_score=0.01
了呢?
比如说一些分类的置信度在很低情况下才能被识别出来
例如比较A和B的模型,平常我们设置,min_score=0.2,来侦测,但是有一个物体(如:瓶子)这个在min_score=0.1情况下两个模型才能识别出来(假设两个模型对其他物体检测效果一模一样)
假设A识别瓶子效果比较好,但我们评价的时候也设置min_score=0.2,可以看出模型A和B的mAP得分是一样的,这样就对A不太公平,明明它的检测瓶子效果好为啥不计算入mAP呢,为了保证公平,min_score取小点,这样对于所有物体的检测都是公平的。
然后可能经过思考还会有问题, min_score=0.01
了,那其他本来检测好的目标,多出来几个目标框,会不会影响其他物体的检测呢?(多出了一些无关框)
大家的目标框都增加了,所以是公平的。虽然在mAP数值上出现浮动,但这并不影响两个模型比较。
为啥不直接min_score=0
?其实是可以的,但是那么多框,计算起来就会非常慢,没必要。
因此设置min_score=0.01
是比较好的选择方式。至于选择0.02还是0.03,根据经验值,是侧重评估精度还是评估速度了。
还有一些需要注意的细节:
这边model.detect_objects()
返回的box的值为 ( x 1 , y 1 , x 2 , y 2 ) (x1,y1,x2,y2) (x1,y1,x2,y2)且归一化后过的
calculate_mAP()
处理中,box都是以 ( x 1 , y 1 , x 2 , y 2 ) (x1,y1,x2,y2) (x1,y1,x2,y2)进行计算和处理,在同一个尺度上操作。
而损失函数不一样,是在 ( g c x , g c y , g w , g h ) (g_{cx},g_{cy},g_{w},g_{h}) (gcx,gcy,gw,gh)的尺度上进行计算的。
calculate_mAP
的代码实现方式先看下整体代码
def calculate_mAP(det_boxes, det_labels, det_scores, true_boxes, true_labels, true_difficulties):
"""
Calculate the Mean Average Precision (mAP) of detected objects.
这里的map指标遵循的VOC2007的标准,具体地:
统一用IOU>0.5作为目标框是否准召的标准
AP的计算标准采用召回分别为 0:0.05:1 时的准确率平均得到
See https://medium.com/@jonathan_hui/map-mean-average-precision-for-object-detection-45c121a31173 for an explanation
:param det_boxes: list of tensors, one tensor for each image containing detected objects' bounding boxes
:param det_labels: list of tensors, one tensor for each image containing detected objects' labels
:param det_scores: list of tensors, one tensor for each image containing detected objects' labels' scores
:param true_boxes: list of tensors, one tensor for each image containing actual objects' bounding boxes
:param true_labels: list of tensors, one tensor for each image containing actual objects' labels
:param true_difficulties: list of tensors, one tensor for each image containing actual objects' difficulty (0 or 1)
:return: list of average precisions for all classes, mean average precision (mAP)
"""
# make sure all lists of tensors of the same length, i.e. number of images
assert len(det_boxes) == len(det_labels) == len(det_scores) == \
len(true_boxes) == len(true_labels) == len(true_difficulties)
n_classes = len(label_map)
#下面是进行的一些计算前的预处理,因为计算mAP的时候不需要引入batch的概念,用torch.cat,整合一下
#可以方便后续计算(不然后续还要对batch额外套个循环,比较麻烦)
# Store all (true) objects in a single continuous tensor while keeping track of the image it is from
true_images = list()
for i in range(len(true_labels)):
true_images.extend([i] * true_labels[i].size(0))
true_images = torch.LongTensor(true_images).to(device) # (n_objects), n_objects: total num of objects across all images
true_boxes = torch.cat(true_boxes, dim=0) # (n_objects, 4)
true_labels = torch.cat(true_labels, dim=0) # (n_objects)
true_difficulties = torch.cat(true_difficulties, dim=0) # (n_objects)
assert true_images.size(0) == true_boxes.size(0) == true_labels.size(0)
# Store all detections in a single continuous tensor while keeping track of the image it is from
det_images = list()
for i in range(len(det_labels)):
det_images.extend([i] * det_labels[i].size(0))
det_images = torch.LongTensor(det_images).to(device) # (n_detections)
det_boxes = torch.cat(det_boxes, dim=0) # (n_detections, 4)
det_labels = torch.cat(det_labels, dim=0) # (n_detections)
det_scores = torch.cat(det_scores, dim=0) # (n_detections)
assert det_images.size(0) == det_boxes.size(0) == det_labels.size(0) == det_scores.size(0)
# Calculate APs for each class (except background)
average_precisions = torch.zeros((n_classes - 1), dtype=torch.float) # (n_classes - 1)
for c in range(1, n_classes):#对每个分类进行分别求
# Extract only objects with this class
true_class_images = true_images[true_labels == c] # (n_class_objects)
true_class_boxes = true_boxes[true_labels == c] # (n_class_objects, 4)
true_class_difficulties = true_difficulties[true_labels == c] # (n_class_objects)
n_easy_class_objects = (1 - true_class_difficulties).sum().item() # ignore difficult objects
#防止模型输出的目标框和真实的目标框重复匹配,这个就是为了做标记
# Keep track of which true objects with this class have already been 'detected'
# So far, none
true_class_boxes_detected = torch.zeros((true_class_difficulties.size(0)), dtype=torch.uint8)
true_class_boxes_detected = true_class_boxes_detected.to(device) # (n_class_objects)
# Extract only detections with this class
det_class_images = det_images[det_labels == c] # (n_class_detections)
det_class_boxes = det_boxes[det_labels == c] # (n_class_detections, 4)
det_class_scores = det_scores[det_labels == c] # (n_class_detections)
n_class_detections = det_class_boxes.size(0)
if n_class_detections == 0:
continue
#如果一批数据中,对c的分类真实目标一个都没,直接换下一个标签
# Sort detections in decreasing order of confidence/scores
det_class_scores, sort_ind = torch.sort(det_class_scores, dim=0, descending=True) # (n_class_detections)
det_class_images = det_class_images[sort_ind] # (n_class_detections)
det_class_boxes = det_class_boxes[sort_ind] # (n_class_detections, 4)
# In the order of decreasing scores, check if true or false positive
#记录1,就是TP或FP,记录0就是TN或FN
true_positives = torch.zeros((n_class_detections), dtype=torch.float).to(device) # (n_class_detections)
false_positives = torch.zeros((n_class_detections), dtype=torch.float).to(device) # (n_class_detections)
for d in range(n_class_detections): #对于分类为(c)的目标,模型认为有的目标数量。每次循环只操作一个模型输出的object,然模型输出的1个object,与实际的多个object操作
this_detection_box = det_class_boxes[d].unsqueeze(0) # (1, 4) #模型预测的目标框
this_image = det_class_images[d] # (), scalar
# Find objects in the same image with this class, their difficulties, and whether they have been detected before
object_boxes = true_class_boxes[true_class_images == this_image] # (n_class_objects_in_img, 4)
object_difficulties = true_class_difficulties[true_class_images == this_image] # (n_class_objects_in_img)
# If no such object in this image, then the detection is a false positive
if object_boxes.size(0) == 0:
false_positives[d] = 1
continue
# Find maximum overlap of this detection with objects in this image of this class
overlaps = find_jaccard_overlap(this_detection_box, object_boxes) # (1, n_class_objects_in_img)
max_overlap, ind = torch.max(overlaps.squeeze(0), dim=0) # (), () - scalars #这是标量,max_overlap是最大的iou,ind是它的索引,也就是说每次循环只操作一个object
# 'ind' is the index of the object in these image-level tensors 'object_boxes', 'object_difficulties'
# In the original class-level tensors 'true_class_boxes', etc., 'ind' corresponds to object with index...
original_ind = torch.LongTensor(range(true_class_boxes.size(0)))[true_class_images == this_image][ind]#与 模型输出的目标框 相匹配的 真实目标框 的索引
# We need 'original_ind' to update 'true_class_boxes_detected'
#因为能输出框,就会被认为是postive的框,因为我们生成目标框的时候,就只将模型认为正样本的输出,所以模型生成框,都会被认为positive
#至于到底是TP还是FP,这是我们接下来程序要做的事情了。
#对输出的且认为该分类(c)的目标框与真实目标框作iou,iou<0.5的直接认定为假的正样本(FP)
#对困难的样本不作计算
#如果iou>0.5,但是对应的目标框已经被模型生成其他目标框匹配过了,那么这个框也是FP,因为一个目标,对应了多个框,只能留下一个,其他都是多余的。
#能满足iou>0.5且之前未被匹配过的就作为TP
#多个模型输出的目标框,选中一个真实的目标框,但是我们把先匹配的认为TP,后来的作为FP,为什么不选iou最大的呢?
#因为一堆框,在iou都满足条件下,只对应一个目标框,有且仅有一个,随便选谁都行,后续是来计算的,不是用来选出到底哪个框好
# If the maximum overlap is greater than the threshold of 0.5, it's a match
if max_overlap.item() > 0.5:
# If the object it matched with is 'difficult', ignore it
if object_difficulties[ind] == 0:
# If this object has already not been detected, it's a true positive
if true_class_boxes_detected[original_ind] == 0:#如果没有被检测过
true_positives[d] = 1 #第d个目标为TP
true_class_boxes_detected[original_ind] = 1 # this object has now been detected/accounted for #做标记,防止重复
# Otherwise, it's a false positive (since this object is already accounted for)
else:
false_positives[d] = 1 #真实目标已经被匹配过了,模型生成的正样本,是多余的,作为FP
# Otherwise, the detection occurs in a different location than the actual object, and is a false positive
else:
false_positives[d] = 1
#(回到分类c的循环下面了)
#计算查准率和召回率
# Compute cumulative precision and recall at each detection in the order of decreasing scores
cumul_true_positives = torch.cumsum(true_positives, dim=0) # (n_class_detections)
cumul_false_positives = torch.cumsum(false_positives, dim=0) # (n_class_detections)
cumul_precision = cumul_true_positives / (
cumul_true_positives + cumul_false_positives + 1e-10) # (n_class_detections)
cumul_recall = cumul_true_positives / n_easy_class_objects # (n_class_detections)
#计算单个分类(c)的平均查准率和召回率,(设置查准率的阈值,用多个矩形的面积代替P-R曲线的面积)
# Find the mean of the maximum of the precisions corresponding to recalls above the threshold 't'
recall_thresholds = torch.arange(start=0, end=1.1, step=.1).tolist() # (11)
precisions = torch.zeros((len(recall_thresholds)), dtype=torch.float).to(device) # (11)
for i, t in enumerate(recall_thresholds):
recalls_above_t = cumul_recall >= t
if recalls_above_t.any():
precisions[i] = cumul_precision[recalls_above_t].max()
else:
precisions[i] = 0.
average_precisions[c - 1] = precisions.mean() # c is in [1, n_classes - 1]
#计算所有分类的平均查准率和召回率
# Calculate Mean Average Precision (mAP)
mean_average_precision = average_precisions.mean().item()
# Keep class-wise average precisions in a dictionary
average_precisions = {
rev_label_map[c + 1]: v for c, v in enumerate(average_precisions.tolist())}
return average_precisions, mean_average_precision
逐行的解释已经在上面中注释了,来理一下整体的思路
false_positives
和true_positives
以此来获得TP和FP的值)对于上述步骤4,再详细理一下
false_positives
和true_positives
,并作标记false_positives
长度和预测框数量是一样的,第几个预测框就对应false_positives
第几个索引)分段解释一下部分重点
下面是进行的一些计算前的预处理,因为计算mAP的时候不需要引入batch的概念,用torch.cat,整合一下,可以方便后续计算(不然后续还要对batch额外套个循环,比较麻烦)
true_images = list()
for i in range(len(true_labels)):
true_images.extend([i] * true_labels[i].size(0))
true_images = torch.LongTensor(true_images).to(device) # (n_objects), n_objects: total num of objects across all images
true_boxes = torch.cat(true_boxes, dim=0) # (n_objects, 4)
true_labels = torch.cat(true_labels, dim=0) # (n_objects)
true_difficulties = torch.cat(true_difficulties, dim=0) # (n_objects)
assert true_images.size(0) == true_boxes.size(0) == true_labels.size(0)
# Store all detections in a single continuous tensor while keeping track of the image it is from
det_images = list()
for i in range(len(det_labels)):
det_images.extend([i] * det_labels[i].size(0))
det_images = torch.LongTensor(det_images).to(device) # (n_detections)
det_boxes = torch.cat(det_boxes, dim=0) # (n_detections, 4)
det_labels = torch.cat(det_labels, dim=0) # (n_detections)
det_scores = torch.cat(det_scores, dim=0) # (n_detections)
assert det_images.size(0) == det_boxes.size(0) == det_labels.size(0) == det_scores.size(0)
这是对每个分类( c )依次去求
for c in range(1, n_classes):
true_positives
和false_positives
默认为全0的tensor,下面两行代码分别为全为TN和全为FN,但后续tensor里面要填充1,那么1的部分,则认为分别是TP和FP
n_class_detections
是模型输出的目标数量,对应的位置,正样本就是填1,负样本默认为0。
这两行在后续是用来存放,接着来计算TP和FP的值的,这里设定成一个一维的tensor,来明确,每个预测的目标框,是什么(TP,FP,TN,FN)呢
true_positives = torch.zeros((n_class_detections), dtype=torch.float).to(device) # (n_class_detections)
false_positives = torch.zeros((n_class_detections), dtype=torch.float).to(device) # (n_class_detections)
对分类(c )中每个模型预测的目标框进行遍历,依次与真实的目标框去操作,这样可以对,上面两行代码填充1了,去判断每个框是TP还是FP, 不是这两类的会直接跳过,因为默认给了TN和FN(默认为0).
for d in range(n_class_detections):
original_ind
是与 模型预测的目标框 相匹配的 真实目标框 的索引 (比较拗口,停顿用空格表示了)
这个是用来干嘛?更新true_class_boxes_detected
,给作标记
为什么作标记?一个真实框目标框就只对应一个预测的目标框,其他与之IOU较高 的预测框,全部是FP,因为既然模型能输出框,那么在模型眼里,预测框就是正的,但是不能匹配了,所以是假的正样本(FP)
original_ind = torch.LongTensor(range(true_class_boxes.size(0)))[true_class_images == this_image][ind]
因为能输出框,就会被认为是postive的框,因为我们生成目标框的时候,就只将模型认为正样本的输出,所以模型生成框,都会被认为positive
至于到底是TP还是FP,这是我们接下来程序要做的事情了。
对输出的且认为该分类(c)的目标框与真实目标框作iou,iou<0.5的直接认定为假的正样本(FP)
对困难的样本不作计算
如果iou>0.5,但是对应的目标框已经被模型生成其他目标框匹配过了,那么这个框也是FP,因为一个目标,对应了多个框,只能留下一个,其他都是多余的。
能满足iou>0.5且之前未被匹配过的就作为TP
if max_overlap.item() > 0.5:
# If the object it matched with is 'difficult', ignore it
if object_difficulties[ind] == 0:
# If this object has already not been detected, it's a true positive
if true_class_boxes_detected[original_ind] == 0:#如果没有被检测过
true_positives[d] = 1 #第d个目标为TP
true_class_boxes_detected[original_ind] = 1 # this object has now been detected/accounted for #做标记,防止重复
# Otherwise, it's a false positive (since this object is already accounted for)
else:
false_positives[d] = 1 #真实目标已经被匹配过了,模型生成的正样本,是多余的,作为FP
# Otherwise, the detection occurs in a different location than the actual object, and is a false positive
else:
false_positives[d] = 1
在模型输出的多个目标框中,我们去选一个真实的目标框,但是我们把先匹配的认为TP,后来的作为FP,为什么不选iou最大的呢?
因为一堆框,在iou都满足条件下,只对应一个目标框,有且仅有一个,随便选谁都行,后续是来计算的,不是用来选出到底哪个框好。(之前应该还记得我们对置信度排序了(其实不排可能问题也不大,最后用iou判断也能得到可靠的结果),代码中其实选取的顺序就是按置信度来的,置信度大优先,和iou无关系。)
接下来我们得到了每个分类的true_positives
和false_positives
,先将它从一维tensor变成一个标量(一个值)
接着是进行下面操作
recall_thresholds = torch.arange(start=0, end=1.1, step=.1).tolist() # (11)
precisions = torch.zeros((len(recall_thresholds)), dtype=torch.float).to(device) # (11)
for i, t in enumerate(recall_thresholds):
recalls_above_t = cumul_recall >= t
if recalls_above_t.any():
precisions[i] = cumul_precision[recalls_above_t].max()
else:
precisions[i] = 0.
average_precisions[c - 1] = precisions.mean()
上述是计算单个分类( c )的平均查准率和召回率,(对recall进行分段,用多个矩形的面积代替P-R曲线的面积)
于是我们完成了对单个分类的average_precisions
(AP)
接下来是求,所有分类的AP,取平均,得到mAP
mean_average_precision = average_precisions.mean().item()
# Keep class-wise average precisions in a dictionary
average_precisions = {
rev_label_map[c + 1]: v for c, v in enumerate(average_precisions.tolist())}
return average_precisions, mean_average_precision
第一个返回的装有每个类别的AP和标签的字典,第二个返回的是mAP
终于,得到了我们想要的东西了
SSD在原图映射到特征图的时候,通过卷积依次得到了6张特征图,由于卷积得原因,这几张特征图是越来越小的。每张特征图上的cell对应的先验框尺度是不同的,大的特征图对应的是小的先验框(感受野比较小),小的特征图对应的是大的先验框(感受野比较大)。每个特征图的尺度都是一致的,因此只要调整每个cell对应的先验框的长宽比(ratio)就行了,这样子任务明确,大的特征图,用来检测小目标,小的特征图用来检测大目标,具有更好的效果。
本次目标检测入门的学习中,只使用了一张特征图(方便学习理解),cell对应了不同尺度和不同比例的先验框,一个特征图就完成了,SSD中6张特征图需要做的事情。选取的特征图(7 * 7)的感受野,感受野属于中偏大的范围,在检测中等大小、中偏大的大小的物体具有良好的检测效果。对于小物体的检测能力略有不足,如’bottle’,'pottedplant’等检测的效果不尽人意,SSD中使用了32*32的特征图,感受野小,检测小物体效果要好。
补充:感受野
感受野就是对应的特征图在原图上的范围大小。上图是原图经过两次卷积分别得到绿色和橘色的特征图,可以看到橘色(特征图)和绿色(特征图)对应的感受野(蓝色)相同,但是绿色分的格子多,橘色分的格子少,因此绿色特征图上每个cell对应的感受野小(检测小物体),橘色特征图上每个cell对应感受野大(检测大物体)。
1.通过与SSD的比较可以发现,本次学习的目标检测方案的用单一特征图完成了对不同尺度的目标检测,于是就可以通过增加特征图,分尺度,进行检测。
2.调整损失函数设计,可以看到我之间得到的图
少了只狗,后续我继续对侦测部分的置信度阈值和NMS中iou阈值进行了调整,发现,框还是在的,能正确的把狗框起来,但似乎置信度有点低。
于是可以从损失函数入手,增大置信度损失的权重。(原程序中默认给了alpha=1)
3.数据增样,显然这个狗没识别出来,因为学习的数据中缺少这样的狗(置信度低),通过对数据预处理,提高泛化能力。
4.改变特征提取模型,池化存在一定缺陷,它丢失了图片大量信息,池化可以通过一定步长的卷积代替。换成resnet之类的网络,可能会有不错的提升。
5.最后的侦测网络中参数的可以调整,实现最优的参数
这些还是目前我所能想到的方案,具体还是要训练出来看实际的结果。
最后,优化路程漫长~