论文笔记及与相关代码阅读
摘要:提出了一种将目标检测看作直接集预测问题的新方法。我们的方法简化了检测流程,有效地消除了对许多手工设计的组件的需要,例如非最大抑制nms或achor生成,它们明确地编码了我们关于任务的先验知识。新框架的主要组成部分,称为DETR,是一个基于集合的全局损失,通过二部匹配以及Transformers编码器-解码器架构强制唯一的预测。给定一个固定的小范围学习对象查询集,DETR根据对象与全局图像上下文的关系,直接并行输出最终的预测集。与许多其他检测算法不同,新模型在概念上很简单,不需要专门的库。在具有挑战性的COCO目标检测数据集上,DETR的准确度和运行时性能与成熟且高度优化的Faster RCNN基线相当。此外,DETR可以很容易地推广到产生一个统一的方式全景分割。我们发现,它显著优于竞争基线。代码
目标检测的目标是为每个感兴趣的目标预测一组边界框和类别标签。现代检测器通过在一大组建议[37,5]、锚[23]或窗口中心[53,46]上定义替代回归和分类问题,间接地解决了这一集合预测任务。它们的性能受到后处理步骤(瓦解接近重复的预测)、锚集的设计以及将目标框分配给锚的启发式算法的显著影响[52]。为了简化这些流程,我们提出了一种直接集预测方法来绕过代理任务。这种端到端的思想在复杂的结构化预测任务(如机器翻译或语音识别)方面取得了重大进展,但在目标检测方面还没有取得进展:以前的尝试[43,16,4,39]要么添加了其他形式的先验知识,要么在具有挑战性的基准上没有被证明与强大的基线具有竞争力。本文旨在填补这一空白。
我们将目标检测视为一个直接集预测问题来简化训练。我们采用了一种基于transformers的编解码器架构[47],这是一种流行的序列预测架构。transformers的自注意机制可以显式地对序列中元素之间的所有成对交互进行建模,这使得这些体系结构特别适用于集预测的特定约束,例如消除重复预测。
我们的检测变换器(DETR,见图1)一次预测所有对象,并使用一个集损函数进行端到端训练,该函数在预测对象和地面真实对象之间执行二部匹配。DETR通过删除编码先验知识的多个手工设计的组件(如anchor或非最大抑制nms),简化了检测流程。与大多数现有的检测方法不同,DETR不需要任何定制的层,因此可以在任何包含标准CNN和transformer类的框架中轻松复制。
与以往大多数关于直接集预测的工作相比,DETR的主要特点是二部匹配损耗和transformer(非自回归)并行译码的结合[29,12,10,8]。相比之下,以前的工作主要集中在使用RNNs的自回归解码上[43,41,30,36,42]。我们的匹配损失函数唯一地将预测分配给地面真值对象,并且对预测对象的排列是不变的,因此我们可以并行地发射它们。
我们在一个最流行的目标检测数据集COCO[24]上评估了DETR,并与一个非常有竞争力的Faster R-CNN基线进行了比较[37]。Faster R-CNN经历了多次设计迭代,自最初发布以来,其性能得到了极大的提高。我们的实验表明,我们的新模型达到了相当的性能。更准确地说,DETR在大型对象上表现出明显更好的性能,这一结果很可能是由transformer的非局部计算实现的。但是,它在小对象上的性能较低。我们期望未来的工作能改善这一方面,像FPN[22]的发展Faster R-CNN一样。
DETR的设计理念很容易扩展到更复杂的任务。在我们的实验中,我们证明了一个简单的分割头训练在一个预先训练好的DETR上比全景分割的竞争基线更具竞争力[19],这是一个具有挑战性的像素级识别任务,最近得到了广泛的应用。
该模型首先是经历一个CNN提取特征,然后得到的特征进入transformer, 最后将transformer输出的结果转化为class和box.输出固定个数的预测结果。最后将得到的检测结果和GroundTruth进行二部图匹配计算loss
我们的工作建立在几个领域的先前工作之上:集合预测的二部匹配损失、基于transformer的编码器-解码器结构、并行解码和目标检测方法。
没有一个规范的深度学习模型可以直接预测集合。基本的集合预测任务是多标签分类(参见[40,33]以获取计算机视觉背景下的参考文献),对于该分类,基线方法“一对一”不适用于诸如在元素之间存在底层结构(即,接近相同的框)的检测等问题。这些任务的第一个困难是避免近乎重复的任务。目前大多数检测器采用非最大值抑制等后处理方法来解决这一问题,而直接集预测是无需后处理的。它们需要全局推理方案来模拟所有预测元素之间的交互,以避免冗余。对于恒定大小集预测,密集的完全连接网络[9]是足够的,但成本高昂。一般的方法是使用自回归序列模型,如递归神经网络[48]。在所有情况下,损失函数都应通过预测的排列保持不变。通常的解决方案是基于匈牙利算法[20]设计一个损耗,在地面真值和预测之间找到一个二部匹配。这强制了置换不变性,并保证每个目标元素具有唯一的匹配。我们遵循二部匹配损失方法。然而,与大多数先前的工作相比,我们远离自回归模型,使用具有并行解码的transformer,我们将在下面描述。
Vaswani等人[47]引入了Transformers,作为机器翻译的一个新的基于注意力的构建块。注意机制[2]是从整个输入序列聚合信息的神经网络层。Transformers引入了自我注意层,与非局部神经网络[49]类似,它扫描序列的每个元素,并通过聚集整个序列的信息来更新。基于注意的模型的一个主要优点是其全局计算能力和完美的记忆能力,这使得它比RNNs更适合于长序列。在自然语言处理、语音处理和计算机视觉的许多问题中,Transformers正在取代RNN[8,27,45,34,31]。
Transformers首先用于自回归模型,继早期的序列到序列模型[44]之后,一个接一个地生成输出标记。然而,在音频[29]、机器翻译[12,10]、词表示学习[8]和最近的语音识别[6]等领域,禁止性的推理成本(与输出长度成正比,且难以批量)导致了并行序列生成的发展。我们还结合Transformers和并行解码之间的适当权衡计算成本和能力,以执行集预测所需的全局计算。
基于集合的损失。一些目标探测器[9,25,35]使用了二部匹配损失。然而,在这些早期的深度学习模型中,不同的预测之间的关系仅用卷积层或完全连接层来建模,手工设计的NMS后处理可以提高它们的性能。最近的检测器[37,23,53]在真值和预测以及NMS之间使用非唯一分配规则。
可学习的NMS方法[16,4]和关系网络[17]显式地建模了不同预测之间的关系。使用直接设置损耗,它们不需要任何后处理步骤。然而,这些方法采用了额外的手工构建的上下文特征,如建议框坐标,来有效地建立检测之间的关系模型,同时我们寻找减少模型中编码的先验知识的解决方案。
循环检测器。最接近我们的方法是目标检测的端到端集预测[43]和实例分割[41,30,36,42]。与我们类似,他们使用基于CNN激活的编码器-解码器结构的二部匹配损失来直接生成一组边界框。然而,这些方法只在小数据集上进行了评估,没有对照现代基线进行评估。特别是,它们基于自回归模型(更准确地说是RNN),因此它们不会利用最近的Transformers进行并行解码。
在检测中,直接集合预测有两个要素是必不可少的:(1)集合预测损失,强制预测和地面真值框之间的唯一匹配;(2)预测(在一次传递中)一组对象并对其关系建模的体系结构。我们在图2中详细描述了我们的体系结构。
DETR在通过解码器的一次过程中推断N个预测的固定大小集,其中N被设置为显著大于图像中对象的典型数目。训练的主要困难之一是根据地面真实情况对预测对象(类别、位置、大小)进行评分。我们的损失产生一个最佳的二部匹配之间的预测和地面真相的对象,然后优化特定对象(边界框)的损失。
首先将gt box也变成长度为N的序列以便和网络输出进行匹配,不够长度的用 ∅ 补充,然后对这个序列排列组合,找到和预测的N个序列损失最小的序列来优化。这样就可以得到一对一的关系,也不用后处理操作NMS。
它包含三个主要组件,我们将在下面描述:一个用于提取紧凑特征表示的CNN主干、一个编码器-解码器转换器和一个用于进行最终检测预测的简单前馈网络(FFN)。
backbone就是传统的CNN结构,用于提取图像的2D信息,网络一开始是使用Backbone(比如ResNet)提取一些feature,然后降维到d×HW,backbone:[3, H, W] 变为[2048, H/32, W/32]。prediction heads用于对decoder的输出进行分类。
Transformer encoder。首先,1x1卷积将高级激活图f的通道维数从C降低到更小的维数d。创建一个新的特征映射z0∈Rd×H×W。编码器期望一个序列作为输入,因此我们将z0的空间维数压缩为一维,得到一个d×HW特征映射。每个编码器层都有一个标准的体系结构,由一个多头自注意模块和一个前馈网络(FFN)组成。由于transformer结构是置换不变的,因此我们用固定的位置编码[31,3]来补充它,这些编码被添加到每个注意层的输入中。我们根据补充材料对该体系结构的详细定义,该定义遵循[47]中所述的定义。
Transformer decoder. 解码器遵循转换器的标准架构,使用多头自和编码器-解码器注意机制转换大小为d的N个嵌入。与原始转换器的不同之处在于,我们的模型在每个解码器层并行解码N个对象,而Vaswani等人[47]使用自回归模型一次预测一个元素的输出序列。我们建议不熟悉这些概念的读者参阅补充材料。由于解码器也是置换不变的,N个输入嵌入必须不同才能产生不同的结果。这些输入嵌入是学习的位置编码,我们称之为对象查询,类似于编码器,我们将它们添加到每个注意层的输入中。解码器将N个对象查询转化为一个嵌入输出。然后,它们被前馈网络(在下一小节中描述)独立地解码成方框坐标和类标签,从而得到N个最终预测。该模型利用自身和编解码器对这些嵌入的关注,利用对象之间的成对关系,对所有对象进行全局推理,同时能够使用整个图像作为上下文。
pytorch版本的detr推理过程只有50行代码。给定固定的学习对象查询集,则DETR会考虑对象与全局图像上下文之间的关系,以直接并行并行输出最终的预测集。 由于这种并行性质,DETR非常快速和高效。
python中的transformer
a demo of DETR (Detection Transformer), with slight differences with the baseline model in the paper.We show how to define the model, load pretrained weights and visualize bounding box and class predictions.基于预训练权重实现detr的简化版,并测试一张图像。
得益于transformer的代表性功能,DETR架构非常简单。有两个主要组成部分:
backbone:采用Resnet50
Transformer:使用默认的PyTorch nn.Transformer
模型代码如下:
'''
detr简单应用demo推理一张图像的目标检测
'''
from PIL import Image
import requests
import matplotlib.pyplot as plt
import torch
from torch import nn
from torchvision.models import resnet50
import torchvision.transforms as T
torch.set_grad_enabled(False) # 前向推理,不需要梯度反传
# 简化的DETR实现
class DETRdemo(nn.Module):
"""
Demo DETR implementation.
Demo implementation of DETR in minimal number of lines, with the
following differences wrt DETR in the paper:
* learned positional encoding (instead of sine)
* positional encoding is passed at input (instead of attention)
* fc bbox predictor (instead of MLP)
The model achieves ~40 AP on COCO val5k and runs at ~28 FPS on Tesla V100.
Only batch size 1 supported.
"""
def __init__(self, num_classes, hidden_dim=256, nheads=8, num_encoder_layers=6, num_decoder_layers=6):
# 6个编码器/解码器,nheads为多头注意力的头数
super().__init__()
# create ResNet-50 backbone
self.backbone = resnet50()
del self.backbone.fc
# create conversion layer (输入transformer之前降低特征通道数)
self.conv = nn.Conv2d(2048, hidden_dim, 1) # 把通道数从2048变成hidden_dim,卷积核大小为1
# create a default PyTorch transformer
self.transformer = nn.Transformer(
hidden_dim, nheads, num_encoder_layers, num_decoder_layers)
# PyTorch将Transformer相关的模型分为nn.TransformerEncoderLayer、nn.TransformerDecoderLayer、nn.LayerNorm等几个部分
'''
pytorch中有5类:
1.torch.nn.Transformer(d_model=512, nhead=8, num_encoder_layers=6, num_decoder_layers=6, dim_feedforward=2048,
dropout=0.1, activation='relu', custom_encoder=None, custom_decoder=None)
transformer模型,基于论文 Attention Is All You Need
d_model –编码器/解码器输入大小(默认 512)
nhead –多头注意力模型的头数(默认为8)
num_encoder_layers –编码器中子编码器层的数量(默认为6)
num_decoder_layers –解码器中子解码器层的数量(默认为6)
dim_feedforward –前馈网络模型的中间层维度(默认= 2048)
activation–编码器/解码器中间层的激活函数,relu或gelu(默认值= relu)
custom_encoder –自定义编码器(默认=None)
custom_decoder –自定义解码器(默认=None)
2.torch.nn.TransformerEncoder(encoder_layer, num_layers, norm=None)
TransformerEncoder是N个编码器层的堆叠
encoder_layer – TransformerEncoderLayer()的实例(必需)
num_layers –编码器中的子编码器层数(必填)。
3.torch.nn.TransformerEncoderLayer(d_model, nhead, dim_feedforward=2048, dropout=0.1, activation='relu')
TransformerEncoderLayer 由self-attn和feedforward组成
d_model – 编码器/解码器输入中预期词向量的大小.
nhead – 多头注意力模型中的头数.
dim_feedforward – 前馈网络模型的尺寸(默认值= 2048).
activation – 编码器/解码器中间层,激活函数relu或gelu(默认=relu).
'''
# prediction heads, one extra class for predicting non-empty slots
# note that in baseline DETR linear_bbox layer is 3-layer MLP
# bbox含框的中心的坐标和宽度高度,即(x,y,w,h)四维
self.linear_class = nn.Linear(hidden_dim, num_classes + 1)
self.linear_bbox = nn.Linear(hidden_dim, 4)
# output positional encodings (object queries)
# 一张图片最多检测100个物体
self.query_pos = nn.Parameter(torch.rand(100, hidden_dim))
# spatial positional encodings
# note that in baseline DETR we use sine positional encodings
# 编码器的位置编码,对特征图行和列分别进行位置编码,后面会将两者拼接起来故维度为一半
# 特征图尺寸不超过50*50 50个行,50个列最多
self.row_embed = nn.Parameter(torch.rand(50, hidden_dim // 2)) # 行
self.col_embed = nn.Parameter(torch.rand(50, hidden_dim // 2)) # 列
# nn.Parameter参数可学习的,这个tensor是Parameter类
def forward(self, inputs):
# propagate inputs through ResNet-50 up to avg-pool layer
# inputs为[3,H,W]
x = self.backbone.conv1(inputs) # resnet50的conv1图像变成[64,H/2,W/2] 网络输入图像深度为3(正好是彩色图片的3个通道),输出深度为64,滤波器为7*7,步长为2,填充3层,特征图的h,w缩小1/2
x = self.backbone.bn1(x)
x = self.backbone.relu(x)
x = self.backbone.maxpool(x) # 此时,特征图的尺寸已成为输入的1/4, [64,H/4,W/4]
x = self.backbone.layer1(x) # [256,H/4,W/4]
x = self.backbone.layer2(x) # [512,H/8,W/8]
x = self.backbone.layer3(x) # [1024,H/16,W/16]
x = self.backbone.layer4(x) # [2048,H/32,W/32]
# convert from 2048 to 256 feature planes for the transformer
h = self.conv(x) # [hidden_dim,H/32,W/32] 论文中的[d,H0/32,W0/32]
# construct positional encodings
H, W = h.shape[-2:] # img.shape[:2] 取彩色图片的长、宽;img.shape[:3] 则取彩色图片的长、宽、通道;img.shape[0]:图像的垂直尺寸(高度);pytorch tensor matrix: (N, C, H, W)
pos = torch.cat([
self.col_embed[:W].unsqueeze(0).repeat(H, 1, 1), # repeat(H, 1, 1)沿着特定的维度重复这个张量;unsqueeze(0)在0维上增加一维,[H,W,hidden_dim//2]
self.row_embed[:H].unsqueeze(1).repeat(1, W, 1), # [H,W,hidden_dim//2]
], dim=-1).flatten(0, 1).unsqueeze(1) # (H×W,1,hidden_dim) 对应论文中d×HW的特征图
# torch.cat是将两个张量(tensor)拼接在一起,torch.cat((A,B),0)就表示按维数0(行)拼接A和B,也就是竖着拼接,A上B下;torch.cat((A,B),1)就表示按维数1(列)拼接A和B,也就是横着拼接,A左B右
# dim=-1 表示倒数第一维即hiddden_dim//2上
# torch.flatten(input, start_dim, end_dim=) 把input从第start_dim起到end_dim止展平
# 变成(H×W,1,hidden_dim),1为batchsize
# propagate through the transformer
h = self.transformer(pos + 0.1 * h.flatten(2).permute(2, 0, 1),
self.query_pos.unsqueeze(1)).transpose(0, 1)
# 现在query_pos大小为(100, 1,hidden_dim),得到的h大小为(batch, 100,hidden_dim)
# finally project transformer outputs to class labels and bounding boxes
return {
'pred_logits': self.linear_class(h),
'pred_boxes': self.linear_bbox(h).sigmoid()}
让我们用80个COCO输出类别+ 1个“无对象”类别构建模型,并加载预训练的权重。 权重以半精度保存,以节省带宽而不损害模型精度。
加载初始化模型实例化一个模型:
detr = DETRdemo(num_classes=91)
state_dict = torch.hub.load_state_dict_from_url(
url='https://dl.fbaipublicfiles.com/detr/detr_demo-da2a99e9.pth',
map_location='cpu', check_hash=True) # 加载预训练模型
detr.load_state_dict(state_dict) # 模型加载到网络中
detr.eval() # 测试模型
我们刚刚加载的预训练DETR模型已经在80个COCO类上进行了训练,类索引的范围是1到90(这就是为什么我们在模型构造中考虑了91个类)。 在以下单元格中,我们定义从类索引到名称的映射。
# COCO classes
CLASSES = [
'N/A', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',
'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'N/A',
'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse',
'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'N/A', 'backpack',
'umbrella', 'N/A', 'N/A', 'handbag', 'tie', 'suitcase', 'frisbee', 'skis',
'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove',
'skateboard', 'surfboard', 'tennis racket', 'bottle', 'N/A', 'wine glass',
'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich',
'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake',
'chair', 'couch', 'potted plant', 'bed', 'N/A', 'dining table', 'N/A',
'N/A', 'toilet', 'N/A', 'tv', 'laptop', 'mouse', 'remote', 'keyboard',
'cell phone', 'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'N/A',
'book', 'clock', 'vase', 'scissors', 'teddy bear', 'hair drier',
'toothbrush'
]
# colors for visualization
COLORS = [[0.000, 0.447, 0.741], [0.850, 0.325, 0.098], [0.929, 0.694, 0.125],
[0.494, 0.184, 0.556], [0.466, 0.674, 0.188], [0.301, 0.745, 0.933]]
DETR使用标准ImageNet归一化,并以[xcenter,ycenter,w,h]格式输出相对图像坐标中的框,其中[xcenter,ycenter]是边界框的预测中心,而w,h是边框的宽度和高度。 由于坐标是相对于图像尺寸的并且位于[0,1]之间,因此出于可视化目的,我们将预测转换为绝对图像坐标和[x0,y0,x1,y1]格式。
# DETR使用标准ImageNet归一化,并以[xcenter,ycenter,w,h]格式输出相对图像坐标中的框,其中[xcenter,ycenter]是边界框的预测中心,
# 而w,h是边框的宽度和高度。 由于坐标是相对于图像尺寸的并且位于[0,1]之间,因此出于可视化目的,我们将预测转换为绝对图像坐标和[x0,y0,x1,y1]格式
# standard PyTorch mean-std input image normalization 标准PyTorch均值输入图像归一化
transform = T.Compose([
T.Resize(800),
T.ToTensor(),
T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]) # 前面的[0.485, 0.456, 0.406]是RGB三个通道上的均值mean, 后面[0.229, 0.224, 0.225]是三个通道的标准差std
]) # Compose的主要作用是将多个变换组合在一起。
# for output bounding box post-processing
# 将bbox的中心x、中心y坐标,高度宽度转换为左上角右下角坐标
def box_cxcywh_to_xyxy(x):
x_c, y_c, w, h = x.unbind(1) # torch.unbind(input, dim=0) → seq,此方法就是将我们的input从dim进行切片,并返回切片的结果,返回的结果里面没有dim这个维度。
b = [(x_c - 0.5 * w), (y_c - 0.5 * h),
(x_c + 0.5 * w), (y_c + 0.5 * h)]
return torch.stack(b, dim=1) # shape为[batch,4]
def rescale_bboxes(out_bbox, size):
img_w, img_h = size
b = box_cxcywh_to_xyxy(out_bbox)
b = b * torch.tensor([img_w, img_h, img_w, img_h], dtype=torch.float32) # bbox归一化坐标转为基于图像尺寸的绝对坐标
return b
封装推断过程,实现输入到输出预测结果:
# Let's put everything together in a detect function:封装整个推断过程获得预测结果
def detect(im, model, transform):
# mean-std normalize the input image (batch-size: 1)
img = transform(im).unsqueeze(0)
# demo model only support by default images with aspect ratio between 0.5 and 2
# if you want to use images with an aspect ratio outside this range
# rescale your image so that the maximum size is at most 1333 for best results
assert img.shape[-2] <= 1600 and img.shape[-1] <= 1600, 'demo model only supports images up to 1600 pixels on each side'
# propagate through the model
outputs = model(img)
# keep only predictions with 0.7+ confidence
probas = outputs['pred_logits'].softmax(-1)[0, :, :-1]
keep = probas.max(-1).values > 0.7 # 只取置信度大于0.7的
# convert boxes from [0; 1] to image scales
bboxes_scaled = rescale_bboxes(outputs['pred_boxes'][0, keep], im.size)
return probas[keep], bboxes_scaled
对一张图进行测试
# To try DETRdemo model on your own image just change the URL below.
# 对一张图进行测试
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
im = Image.open(requests.get(url, stream=True).raw)
scores, boxes = detect(im, detr, transform)
结果可视化
# Let's now visualize the model predictions
def plot_results(pil_img, prob, boxes):
plt.figure(figsize=(16, 10))
plt.imshow(pil_img)
ax = plt.gca()
for p, (xmin, ymin, xmax, ymax), c in zip(prob, boxes.tolist(), COLORS * 100):
ax.add_patch(plt.Rectangle((xmin, ymin), xmax - xmin, ymax - ymin,
fill=False, color=c, linewidth=3))
cl = p.argmax()
text = f'{CLASSES[cl]}: {p[cl]:0.2f}'
ax.text(xmin, ymin, text, fontsize=15,
bbox=dict(facecolor='yellow', alpha=0.5))
plt.axis('off')
plt.show()
plot_results(im, scores, boxes)
# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
"""
Backbone modules.
"""
from collections import OrderedDict
import torch
import torch.nn.functional as F
import torchvision
from torch import nn
from torchvision.models._utils import IntermediateLayerGetter
from typing import Dict, List
from util.misc import NestedTensor, is_main_process
from .position_encoding import build_position_encoding
class FrozenBatchNorm2d(torch.nn.Module):
"""
BatchNorm2d where the batch statistics and the affine parameters are fixed.
Copy-paste from torchvision.misc.ops with added eps before rqsrt,
without which any other models than torchvision.models.resnet[18,34,50,101]
produce nans.
"""
def __init__(self, n):
super(FrozenBatchNorm2d, self).__init__()
self.register_buffer("weight", torch.ones(n))
self.register_buffer("bias", torch.zeros(n))
self.register_buffer("running_mean", torch.zeros(n))
self.register_buffer("running_var", torch.ones(n))
# 把这四个量注册到buffer,避免梯度反向传播更新他们。统计量(方差,均值),通常用于注册不应被视为模型参数的缓冲区,同时,模型保存和加载的时候可以写入和读出。
def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
missing_keys, unexpected_keys, error_msgs):
num_batches_tracked_key = prefix + 'num_batches_tracked'
if num_batches_tracked_key in state_dict:
del state_dict[num_batches_tracked_key]
super(FrozenBatchNorm2d, self)._load_from_state_dict(
state_dict, prefix, local_metadata, strict,
missing_keys, unexpected_keys, error_msgs)
def forward(self, x):
# move reshapes to the beginning
# to make it fuser-friendly
w = self.weight.reshape(1, -1, 1, 1)
b = self.bias.reshape(1, -1, 1, 1)
rv = self.running_var.reshape(1, -1, 1, 1)
rm = self.running_mean.reshape(1, -1, 1, 1)
eps = 1e-5
scale = w * (rv + eps).rsqrt()
bias = b - rm * scale
return x * scale + bias
class BackboneBase(nn.Module):
def __init__(self, backbone: nn.Module, train_backbone: bool, num_channels: int, return_interm_layers: bool):
super().__init__()
for name, parameter in backbone.named_parameters():
if not train_backbone or 'layer2' not in name and 'layer3' not in name and 'layer4' not in name:
parameter.requires_grad_(False)
if return_interm_layers: # True的时候,记录每一层(Resnet的layer)的输出
return_layers = {
"layer1": "0", "layer2": "1", "layer3": "2", "layer4": "3"}
else:
return_layers = {
'layer4': "0"}
self.body = IntermediateLayerGetter(backbone, return_layers=return_layers)
# IntermediateLayerGetter继承nn.ModuleDict,接受一个nn.Module和一个dict作为初始化参数,dict的key对应nn.Module的模块,value则是用户自定义的对应于各个模块输出的名字
# IntermediateLayerGetter的输出是一个dict,对应return_layers里面每一层的输出
self.num_channels = num_channels
def forward(self, tensor_list: NestedTensor):
xs = self.body(tensor_list.tensors)
out: Dict[str, NestedTensor] = {
}
for name, x in xs.items():
m = tensor_list.mask
assert m is not None
# 将mask插值到与输出特征图尺寸一致
mask = F.interpolate(m[None].float(), size=x.shape[-2:]).to(torch.bool)[0]
out[name] = NestedTensor(x, mask) # 将图像张量和对应的mask封装到一起
return out
class Backbone(BackboneBase):
"""ResNet backbone with frozen BatchNorm."""
def __init__(self, name: str, # backbone采用的模型名字如resnet50
train_backbone: bool,
return_interm_layers: bool,
dilation: bool):
backbone = getattr(torchvision.models, name)(
replace_stride_with_dilation=[False, False, dilation],
pretrained=is_main_process(), norm_layer=FrozenBatchNorm2d)
num_channels = 512 if name in ('resnet18', 'resnet34') else 2048
super().__init__(backbone, train_backbone, num_channels, return_interm_layers)
class Joiner(nn.Sequential): # 把backbone和position_embedding集合到一起
def __init__(self, backbone, position_embedding):
super().__init__(backbone, position_embedding)
# self[0]是backbone,self[1]是position_embedding
def forward(self, tensor_list: NestedTensor):
xs = self[0](tensor_list) # backbone的输出
out: List[NestedTensor] = []
pos = []
for name, x in xs.items():
out.append(x)
# position encoding
pos.append(self[1](x).to(x.tensors.dtype))
return out, pos
def build_backbone(args):
position_embedding = build_position_encoding(args)
train_backbone = args.lr_backbone > 0
return_interm_layers = args.masks
backbone = Backbone(args.backbone, train_backbone, return_interm_layers, args.dilation)
model = Joiner(backbone, position_embedding)
model.num_channels = backbone.num_channels
return model
# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
"""
Various positional encodings for the transformer.
"""
import math
import torch
from torch import nn
from util.misc import NestedTensor
class PositionEmbeddingSine(nn.Module):
"""
This is a more standard version of the position embedding, very similar to the one
used by the Attention is all you need paper, generalized to work on images.
"""
def __init__(self, num_pos_feats=64, temperature=10000, normalize=False, scale=None):
super().__init__()
self.num_pos_feats = num_pos_feats
self.temperature = temperature
self.normalize = normalize
if scale is not None and normalize is False:
raise ValueError("normalize should be True if scale is passed")
# 角度范围在0-2π
if scale is None:
scale = 2 * math.pi
self.scale = scale
def forward(self, tensor_list: NestedTensor):
x = tensor_list.tensors # (b,c,h,w)
mask = tensor_list.mask # 指示了图像哪些位置是padding而来 (b,h,w),# mask中以左上角为起始点,resize后的原图大小区域为false,其他区域为1
assert mask is not None
not_mask = ~mask # 值为TRUE的部分非padding而来,图像真实有效的部分
y_embed = not_mask.cumsum(1, dtype=torch.float32) # 在第一维列方向叠加
x_embed = not_mask.cumsum(2, dtype=torch.float32) # 在第二维行方向叠加
if self.normalize:
eps = 1e-6
y_embed = y_embed / (y_embed[:, -1:, :] + eps) * self.scale
x_embed = x_embed / (x_embed[:, :, -1:] + eps) * self.scale
dim_t = torch.arange(self.num_pos_feats, dtype=torch.float32, device=x.device)
dim_t = self.temperature ** (2 * (dim_t // 2) / self.num_pos_feats)
pos_x = x_embed[:, :, :, None] / dim_t
pos_y = y_embed[:, :, :, None] / dim_t
pos_x = torch.stack((pos_x[:, :, :, 0::2].sin(), pos_x[:, :, :, 1::2].cos()), dim=4).flatten(3) # 对应论文中的位置编码公式
pos_y = torch.stack((pos_y[:, :, :, 0::2].sin(), pos_y[:, :, :, 1::2].cos()), dim=4).flatten(3)
pos = torch.cat((pos_y, pos_x), dim=3).permute(0, 3, 1, 2)
return pos
class PositionEmbeddingLearned(nn.Module):
"""
Absolute pos embedding, learned.
"""
def __init__(self, num_pos_feats=256):
super().__init__()
self.row_embed = nn.Embedding(50, num_pos_feats)
self.col_embed = nn.Embedding(50, num_pos_feats)
self.reset_parameters()
def reset_parameters(self):
nn.init.uniform_(self.row_embed.weight)
nn.init.uniform_(self.col_embed.weight)
def forward(self, tensor_list: NestedTensor):
x = tensor_list.tensors
h, w = x.shape[-2:]
i = torch.arange(w, device=x.device)
j = torch.arange(h, device=x.device)
x_emb = self.col_embed(i)
y_emb = self.row_embed(j)
pos = torch.cat([
x_emb.unsqueeze(0).repeat(h, 1, 1),
y_emb.unsqueeze(1).repeat(1, w, 1),
], dim=-1).permute(2, 0, 1).unsqueeze(0).repeat(x.shape[0], 1, 1, 1)
return pos
def build_position_encoding(args):
N_steps = args.hidden_dim // 2
if args.position_embedding in ('v2', 'sine'):
# TODO find a better way of exposing other arguments
position_embedding = PositionEmbeddingSine(N_steps, normalize=True)
elif args.position_embedding in ('v3', 'learned'):
position_embedding = PositionEmbeddingLearned(N_steps)
else:
raise ValueError(f"not supported {args.position_embedding}")
return position_embedding
# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
"""
DETR Transformer class.
Copy-paste from torch.nn.Transformer with modifications: 从torch.nn.Transformer复制粘贴并进行以下修改:
* positional encodings are passed in MHattention
* extra LN at the end of encoder is removed
* decoder returns a stack of activations from all decoding layers
"""
import copy
from typing import Optional, List
import torch
import torch.nn.functional as F
from torch import nn, Tensor
class Transformer(nn.Module):
def __init__(self, d_model=512, nhead=8, num_encoder_layers=6,
num_decoder_layers=6, dim_feedforward=2048, dropout=0.1,
activation="relu", normalize_before=False,
return_intermediate_dec=False):
super().__init__()
# 单层encoder
# 如果是后归一化方式,Encoder每层输出都会进行归一化,因此Encoder最后输出不需要额外归一化,这种情况encoder_norm设为None
encoder_layer = TransformerEncoderLayer(d_model, nhead, dim_feedforward,
dropout, activation, normalize_before)
encoder_norm = nn.LayerNorm(d_model) if normalize_before else None
# 6个单层encoder组成整个Encoder
self.encoder = TransformerEncoder(encoder_layer, num_encoder_layers, encoder_norm)
# 单层decoder
decoder_layer = TransformerDecoderLayer(d_model, nhead, dim_feedforward,
dropout, activation, normalize_before)
decoder_norm = nn.LayerNorm(d_model)
# 6个单层encoder组成整个Decoder
self.decoder = TransformerDecoder(decoder_layer, num_decoder_layers, decoder_norm,
return_intermediate=return_intermediate_dec)
self._reset_parameters()
self.d_model = d_model
self.nhead = nhead
def _reset_parameters(self):
for p in self.parameters():
if p.dim() > 1:
nn.init.xavier_uniform_(p)
def forward(self, src, mask, query_embed, pos_embed):
# flatten NxCxHxW to HWxNxC
bs, c, h, w = src.shape # backbone之后输入到transformer的特征图的batchsize,c,h,w
src = src.flatten(2).permute(2, 0, 1) # 变成[H*W,N,C]
pos_embed = pos_embed.flatten(2).permute(2, 0, 1) # pos_embed变成[H*W,N,C]
query_embed = query_embed.unsqueeze(1).repeat(1, bs, 1)
mask = mask.flatten(1) # mask变成[N.H*W]
tgt = torch.zeros_like(query_embed) # query object
memory = self.encoder(src, src_key_padding_mask=mask, pos=pos_embed)
hs = self.decoder(tgt, memory, memory_key_padding_mask=mask,
pos=pos_embed, query_pos=query_embed)
return hs.transpose(1, 2), memory.permute(1, 2, 0).view(bs, c, h, w)
class TransformerEncoder(nn.Module):
def __init__(self, encoder_layer, num_layers, norm=None):
super().__init__()
self.layers = _get_clones(encoder_layer, num_layers) # _get_clones对相同结构的模块进行复制,返回一个nn.ModuleList实例
self.num_layers = num_layers
self.norm = norm
def forward(self, src,
mask: Optional[Tensor] = None,
src_key_padding_mask: Optional[Tensor] = None,
pos: Optional[Tensor] = None):
# src对应backbone输出的特征图并且维度变换到了hidden_dim,shape为[H*W,b,hidden_dim]
# pos对应backbone最后一层输出的特征图的位置编码,shape为[H*W,b,c]
# src_key_padding_mask对应backbone最后一层输出的特征图的mask,shape为[b,H*W]
output = src
for layer in self.layers:
output = layer(output, src_mask=mask,
src_key_padding_mask=src_key_padding_mask, pos=pos)
if self.norm is not None:
output = self.norm(output)
return output
class TransformerDecoder(nn.Module):
def __init__(self, decoder_layer, num_layers, norm=None, return_intermediate=False):
super().__init__()
self.layers = _get_clones(decoder_layer, num_layers)
self.num_layers = num_layers
self.norm = norm
# 是否需要记录中间层的结果
self.return_intermediate = return_intermediate
def forward(self, tgt, memory,
tgt_mask: Optional[Tensor] = None,
memory_mask: Optional[Tensor] = None,
tgt_key_padding_mask: Optional[Tensor] = None,
memory_key_padding_mask: Optional[Tensor] = None,
pos: Optional[Tensor] = None,
query_pos: Optional[Tensor] = None):
# tgt是query embedding,shape是(num_queries,b,hidden_dim),num_queries是图像中有多少目标默认100
# query_pos对应tgt的位置编码
# memory是encoder的输出,shape是(h*w,b,hidden_dim)
# memory_key_padding_mask对应encoder的src_key_padding_mask
# pos是输入到encoder的位置编码,即memory的编码
output = tgt
intermediate = []
for layer in self.layers:
output = layer(output, memory, tgt_mask=tgt_mask,
memory_mask=memory_mask,
tgt_key_padding_mask=tgt_key_padding_mask,
memory_key_padding_mask=memory_key_padding_mask,
pos=pos, query_pos=query_pos)
if self.return_intermediate:
intermediate.append(self.norm(output))
if self.norm is not None:
output = self.norm(output)
if self.return_intermediate:
intermediate.pop()
intermediate.append(output)
if self.return_intermediate:
return torch.stack(intermediate)
return output.unsqueeze(0)
class TransformerEncoderLayer(nn.Module):
# 由多头注意力机制、前向反馈FFN构成,包含归一化层、激活层、Dropout层、残差连接层
def __init__(self, d_model, nhead, dim_feedforward=2048, dropout=0.1,
activation="relu", normalize_before=False):
super().__init__()
# 多头自注意力层
self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout) # d_model每一个单词本来的词向量长度
# Implementation of Feedforward model
# 前向反馈层FFN
self.linear1 = nn.Linear(d_model, dim_feedforward)
self.dropout = nn.Dropout(dropout)
self.linear2 = nn.Linear(dim_feedforward, d_model)
# 分别用于多头自注意力层和前向反馈层
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.dropout1 = nn.Dropout(dropout)
self.dropout2 = nn.Dropout(dropout)
self.activation = _get_activation_fn(activation)
# _get_activation_fn根据输入参数指定的激活方式返回激活层
# 是否在输入多头自注意力层/前向反馈层前归一化
self.normalize_before = normalize_before
def with_pos_embed(self, tensor, pos: Optional[Tensor]):
return tensor if pos is None else tensor + pos
def forward_post(self,
src,
src_mask: Optional[Tensor] = None,
src_key_padding_mask: Optional[Tensor] = None,
pos: Optional[Tensor] = None):
# 后进行归一化
q = k = self.with_pos_embed(src, pos)
src2 = self.self_attn(q, k, value=src, attn_mask=src_mask,
key_padding_mask=src_key_padding_mask)[0]
src = src + self.dropout1(src2)
src = self.norm1(src)
src2 = self.linear2(self.dropout(self.activation(self.linear1(src))))
src = src + self.dropout2(src2)
src = self.norm2(src)
return src
def forward_pre(self, src,
src_mask: Optional[Tensor] = None, # 防作弊,遮住当前预测位置之后的位置,忽略这些位置不计算相关的自注意力权重
src_key_padding_mask: Optional[Tensor] = None, # backbone最后一层输出特征图对应的mask,True代表原始图像padding的部分,生成注意力的时候被填充-inf
pos: Optional[Tensor] = None):
src2 = self.norm1(src) # 归一化
q = k = self.with_pos_embed(src2, pos) # 位置编码
# 多头自注意力层
src2 = self.self_attn(q, k, value=src2, attn_mask=src_mask,
key_padding_mask=src_key_padding_mask)[0] # 索引[0]为自注意力层的输出,[1]为自注意力权重
src = src + self.dropout1(src2)
src2 = self.norm2(src)
# FCN
src2 = self.linear2(self.dropout(self.activation(self.linear1(src2))))
src = src + self.dropout2(src2)
return src
def forward(self, src,
src_mask: Optional[Tensor] = None,
src_key_padding_mask: Optional[Tensor] = None,
pos: Optional[Tensor] = None):
if self.normalize_before: # 在输入多头自注意力层和前向反馈层前进行归一化
return self.forward_pre(src, src_mask, src_key_padding_mask, pos)
return self.forward_post(src, src_mask, src_key_padding_mask, pos) # 在输入多头自注意力层和前向反馈层之后再进行归一化
class TransformerDecoderLayer(nn.Module):
def __init__(self, d_model, nhead, dim_feedforward=2048, dropout=0.1,
activation="relu", normalize_before=False):
super().__init__()
# 多头自注意力层
self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
# Encoder-Decoder层
self.multihead_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
# Implementation of Feedforward model
# FFN
self.linear1 = nn.Linear(d_model, dim_feedforward)
self.dropout = nn.Dropout(dropout)
self.linear2 = nn.Linear(dim_feedforward, d_model)
# 分别用于以上3层
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.norm3 = nn.LayerNorm(d_model)
self.dropout1 = nn.Dropout(dropout)
self.dropout2 = nn.Dropout(dropout)
self.dropout3 = nn.Dropout(dropout)
self.activation = _get_activation_fn(activation)
self.normalize_before = normalize_before
def with_pos_embed(self, tensor, pos: Optional[Tensor]):
return tensor if pos is None else tensor + pos
def forward_post(self, tgt, memory,
tgt_mask: Optional[Tensor] = None,
memory_mask: Optional[Tensor] = None,
tgt_key_padding_mask: Optional[Tensor] = None,
memory_key_padding_mask: Optional[Tensor] = None,
pos: Optional[Tensor] = None,
query_pos: Optional[Tensor] = None):
# 位置嵌入
q = k = self.with_pos_embed(tgt, query_pos)
# 多头自注意力层,输入与encoder无关
tgt2 = self.self_attn(q, k, value=tgt, attn_mask=tgt_mask,
key_padding_mask=tgt_key_padding_mask)[0]
tgt = tgt + self.dropout1(tgt2)
tgt = self.norm1(tgt)
# Encoder-Decoder层,key和value来自encoder,query来自上一层
tgt2 = self.multihead_attn(query=self.with_pos_embed(tgt, query_pos),
key=self.with_pos_embed(memory, pos),
value=memory, attn_mask=memory_mask,
key_padding_mask=memory_key_padding_mask)[0]
tgt = tgt + self.dropout2(tgt2)
tgt = self.norm2(tgt)
# FFN
tgt2 = self.linear2(self.dropout(self.activation(self.linear1(tgt))))
tgt = tgt + self.dropout3(tgt2)
tgt = self.norm3(tgt)
return tgt
def forward_pre(self, tgt, memory,
tgt_mask: Optional[Tensor] = None,
memory_mask: Optional[Tensor] = None,
tgt_key_padding_mask: Optional[Tensor] = None,
memory_key_padding_mask: Optional[Tensor] = None,
pos: Optional[Tensor] = None,
query_pos: Optional[Tensor] = None):
# 位置嵌入
tgt2 = self.norm1(tgt)
q = k = self.with_pos_embed(tgt2, query_pos)
# 多头自注意力层,输入与encoder无关
tgt2 = self.self_attn(q, k, value=tgt2, attn_mask=tgt_mask,
key_padding_mask=tgt_key_padding_mask)[0]
tgt = tgt + self.dropout1(tgt2)
# Encoder-Decoder层,key和value来自encoder,query来自上一层
tgt2 = self.norm2(tgt)
tgt2 = self.multihead_attn(query=self.with_pos_embed(tgt2, query_pos),
key=self.with_pos_embed(memory, pos),
value=memory, attn_mask=memory_mask,
key_padding_mask=memory_key_padding_mask)[0]
tgt = tgt + self.dropout2(tgt2)
# FFN
tgt2 = self.norm3(tgt)
tgt2 = self.linear2(self.dropout(self.activation(self.linear1(tgt2))))
tgt = tgt + self.dropout3(tgt2)
return tgt
def forward(self, tgt, memory,
tgt_mask: Optional[Tensor] = None,
memory_mask: Optional[Tensor] = None,
tgt_key_padding_mask: Optional[Tensor] = None,
memory_key_padding_mask: Optional[Tensor] = None,
pos: Optional[Tensor] = None,
query_pos: Optional[Tensor] = None):
if self.normalize_before:
return self.forward_pre(tgt, memory, tgt_mask, memory_mask,
tgt_key_padding_mask, memory_key_padding_mask, pos, query_pos)
return self.forward_post(tgt, memory, tgt_mask, memory_mask,
tgt_key_padding_mask, memory_key_padding_mask, pos, query_pos)
def _get_clones(module, N):
return nn.ModuleList([copy.deepcopy(module) for i in range(N)])
def build_transformer(args):
return Transformer(
d_model=args.hidden_dim,
dropout=args.dropout,
nhead=args.nheads,
dim_feedforward=args.dim_feedforward,
num_encoder_layers=args.enc_layers,
num_decoder_layers=args.dec_layers,
normalize_before=args.pre_norm,
return_intermediate_dec=True,
)
def _get_activation_fn(activation):
"""Return an activation function given a string"""
if activation == "relu":
return F.relu
if activation == "gelu":
return F.gelu
if activation == "glu":
return F.glu
raise RuntimeError(F"activation should be relu/gelu, not {activation}.")
# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
"""
Modules to compute the matching cost and solve the corresponding LSAP.
将预测结果和GT进行匹配,匹配采用匈牙利算法,参见https://zhuanlan.zhihu.com/p/96229700
"""
import torch
from scipy.optimize import linear_sum_assignment
from torch import nn
from util.box_ops import box_cxcywh_to_xyxy, generalized_box_iou
class HungarianMatcher(nn.Module):
"""This class computes an assignment between the targets and the predictions of the network
For efficiency reasons, the targets don't include the no_object. Because of this, in general,
there are more predictions than targets. In this case, we do a 1-to-1 matching of the best predictions,
while the others are un-matched (and thus treated as non-objects).
"""
def __init__(self, cost_class: float = 1, cost_bbox: float = 1, cost_giou: float = 1):
"""Creates the matcher
Params:
cost_class: This is the relative weight of the classification error in the matching cost
cost_bbox: This is the relative weight of the L1 error of the bounding box coordinates in the matching cost
cost_giou: This is the relative weight of the giou loss of the bounding box in the matching cost
"""
super().__init__()
self.cost_class = cost_class # 分类损失的权重
self.cost_bbox = cost_bbox # bbox损失的权重
self.cost_giou = cost_giou
assert cost_class != 0 or cost_bbox != 0 or cost_giou != 0, "all costs cant be 0"
@torch.no_grad()
def forward(self, outputs, targets):
""" Performs the matching
Params:
outputs: This is a dict that contains at least these entries:
"pred_logits": Tensor of dim [batch_size, num_queries, num_classes] with the classification logits 预测的分类
"pred_boxes": Tensor of dim [batch_size, num_queries, 4] with the predicted box coordinates 预测的bbox的x,y,h,w
targets: GT
This is a list of targets (len(targets) = batch_size), where each target is a dict containing:
"labels": Tensor of dim [num_target_boxes] (where num_target_boxes is the number of ground-truth
objects in the target) containing the class labels
"boxes": Tensor of dim [num_target_boxes, 4] containing the target box coordinates
Returns:
A list of size batch_size, containing tuples of (index_i, index_j) where:
- index_i is the indices of the selected predictions (in order)
- index_j is the indices of the corresponding selected targets (in order)
For each batch element, it holds:
len(index_i) = len(index_j) = min(num_queries, num_target_boxes)
"""
bs, num_queries = outputs["pred_logits"].shape[:2]
# We flatten to compute the cost matrices in a batch
out_prob = outputs["pred_logits"].flatten(0, 1).softmax(-1) # [batch_size * num_queries, num_classes]
out_bbox = outputs["pred_boxes"].flatten(0, 1) # [batch_size * num_queries, 4]
# Also concat the target labels and boxes
tgt_ids = torch.cat([v["labels"] for v in targets])
tgt_bbox = torch.cat([v["boxes"] for v in targets])
# Compute the classification cost. Contrary to the loss, we don't use the NLL,
# but approximate it in 1 - proba[target class].
# The 1 is a constant that doesn't change the matching, it can be ommitted.
cost_class = -out_prob[:, tgt_ids]
# Compute the L1 cost between boxes
cost_bbox = torch.cdist(out_bbox, tgt_bbox, p=1)
# Compute the giou cost betwen boxes
cost_giou = -generalized_box_iou(box_cxcywh_to_xyxy(out_bbox), box_cxcywh_to_xyxy(tgt_bbox))
# Final cost matrix
C = self.cost_bbox * cost_bbox + self.cost_class * cost_class + self.cost_giou * cost_giou
C = C.view(bs, num_queries, -1).cpu()
sizes = [len(v["boxes"]) for v in targets]
indices = [linear_sum_assignment(c[i]) for i, c in enumerate(C.split(sizes, -1))] # linear_sum_assignment匹配方法
return [(torch.as_tensor(i, dtype=torch.int64), torch.as_tensor(j, dtype=torch.int64)) for i, j in indices]
def build_matcher(args):
return HungarianMatcher(cost_class=args.set_cost_class, cost_bbox=args.set_cost_bbox, cost_giou=args.set_cost_giou)
# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
"""
DETR model and criterion classes.
"""
import torch
import torch.nn.functional as F
from torch import nn
from util import box_ops
from util.misc import (NestedTensor, nested_tensor_from_tensor_list,
accuracy, get_world_size, interpolate,
is_dist_avail_and_initialized)
from .backbone import build_backbone
from .matcher import build_matcher
from .segmentation import (DETRsegm, PostProcessPanoptic, PostProcessSegm,
dice_loss, sigmoid_focal_loss)
from .transformer import build_transformer
class DETR(nn.Module):
""" This is the DETR module that performs object detection """
def __init__(self, backbone, transformer, num_classes, num_queries, aux_loss=False):
""" Initializes the model.
Parameters:
backbone: torch module of the backbone to be used. See backbone.py
transformer: torch module of the transformer architecture. See transformer.py
num_classes: number of object classes 类别数,不包括背景1
num_queries: number of object queries, ie detection slot. This is the maximal number of objects 最大类别数
DETR can detect in a single image. For COCO, we recommend 100 queries.
aux_loss: True if auxiliary decoding losses (loss at each decoder layer) are to be used.是否对Decoder中每一层计算loss
"""
super().__init__()
self.num_queries = num_queries
self.transformer = transformer
hidden_dim = transformer.d_model
self.class_embed = nn.Linear(hidden_dim, num_classes + 1) # 生成分类的预测结果,1是背景
# 生成回归的预测结果。MLP多层感知机由多层nn.Linear()组成,中间层维度映射到hidden_dim,第3层映射到4,一共3层
self.bbox_embed = MLP(hidden_dim, hidden_dim, 4, 3)
self.query_embed = nn.Embedding(num_queries, hidden_dim) # Transformer中初始化query并对其编码生成嵌入
self.input_proj = nn.Conv2d(backbone.num_channels, hidden_dim, kernel_size=1) # CNN提取的特征维度映射到Transformer隐藏层的维度,转化为序列
self.backbone = backbone
self.aux_loss = aux_loss
def forward(self, samples: NestedTensor):
""" The forward expects a NestedTensor, which consists of:将图像张量和对应的mask封装到一起
- samples.tensor: batched images, of shape [batch_size x 3 x H x W]
- samples.mask: a binary mask of shape [batch_size x H x W], containing 1 on padded pixels
每个batch大小填充到一样大,填充的位置mask为1,后面不算这些位置的自注意力
It returns a dict with the following elements:
- "pred_logits": the classification logits (including no-object) for all queries.分类结果
Shape= [batch_size x num_queries x (num_classes + 1)]
- "pred_boxes": The normalized boxes coordinates for all queries, represented as
(center_x, center_y, height, width). These values are normalized in [0, 1],
relative to the size of each individual image (disregarding possible padding).
See PostProcess for information on how to retrieve the unnormalized bounding box.
- "aux_outputs": Optional, only returned when auxilary losses are activated. It is a list of
dictionnaries containing the two above keys for each decoder layer.
"""
# 将样本转为NestedTensor。isinstance() 函数来判断一个对象是否是一个已知的类型
if isinstance(samples, (list, torch.Tensor)):
samples = nested_tensor_from_tensor_list(samples)
# 输入到CNN backbone提取特征 pos为位置编码
features, pos = self.backbone(samples)
# 取最后一层特征和mask
src, mask = features[-1].decompose()
assert mask is not None
# Transformer输出的元组,分别为Decoder和Encoder的输出,这里取第一个为Decoder的输出
hs = self.transformer(self.input_proj(src), mask, self.query_embed.weight, pos[-1])[0]
outputs_class = self.class_embed(hs) # 分类
outputs_coord = self.bbox_embed(hs).sigmoid() # 回归
# hs包含了Decoder中每一层的输出,-1代表最后一层
out = {
'pred_logits': outputs_class[-1], 'pred_boxes': outputs_coord[-1]}
if self.aux_loss:
out['aux_outputs'] = self._set_aux_loss(outputs_class, outputs_coord)
return out
@torch.jit.unused
def _set_aux_loss(self, outputs_class, outputs_coord):
# this is a workaround to make torchscript happy, as torchscript
# doesn't support dictionary with non-homogeneous values, such
# as a dict having both a Tensor and a list.
return [{
'pred_logits': a, 'pred_boxes': b}
for a, b in zip(outputs_class[:-1], outputs_coord[:-1])]
class SetCriterion(nn.Module):
""" This class computes the loss for DETR.计算损失
The process happens in two steps:
1) we compute hungarian assignment between ground truth boxes and the outputs of the model匈牙利匹配
2) we supervise each pair of matched ground-truth / prediction (supervise class and box)
"""
def __init__(self, num_classes, matcher, weight_dict, eos_coef, losses):
""" Create the criterion.
Parameters:
num_classes: number of object categories, omitting the special no-object category省略了无对象空类
matcher: module able to compute a matching between targets and proposals
weight_dict: dict containing as key the names of the losses and as values their relative weight.dict包含损失的名称和损失的相对权重作为值。即各种损失的权重
eos_coef: relative classification weight applied to the no-object category 相对分类权重应用于无对象类别
losses: list of all the losses to be applied. See get_loss for list of available losses.
"""
super().__init__()
self.num_classes = num_classes
self.matcher = matcher
self.weight_dict = weight_dict # 各种损失的权重
self.eos_coef = eos_coef # 针对背景分类的损失权重
self.losses = losses
empty_weight = torch.ones(self.num_classes + 1)
empty_weight[-1] = self.eos_coef
self.register_buffer('empty_weight', empty_weight) # 注册到buffer,被state_dict记录但不被梯度传播更新
def loss_labels(self, outputs, targets, indices, num_boxes, log=True):
"""Classification loss (NLL)
targets dicts must contain the key "labels" containing a tensor of dim [nb_target_boxes]
"""
assert 'pred_logits' in outputs
# (b,num_queries = 100,num_classes+1)
src_logits = outputs['pred_logits']
# indices是匹配的预测query索引与GT索引
# _get_src_permutation_idx返回一个tuple,代表所有匹配的预测结果的batch index(当前batch中属于第几张)和query index(图像中的第几个query对象)
idx = self._get_src_permutation_idx(indices)
# 所有匹配的GT
target_classes_o = torch.cat([t["labels"][J] for t, (_, J) in zip(targets, indices)])
# 初始化为背景
target_classes = torch.full(src_logits.shape[:2], self.num_classes,
dtype=torch.int64, device=src_logits.device)
# 匹配的预测索引对应的值置为匹配的GT
target_classes[idx] = target_classes_o
loss_ce = F.cross_entropy(src_logits.transpose(1, 2), target_classes, self.empty_weight)
losses = {
'loss_ce': loss_ce}
if log:
# TODO this should probably be a separate loss, not hacked in this one here
losses['class_error'] = 100 - accuracy(src_logits[idx], target_classes_o)[0]
return losses
@torch.no_grad()
def loss_cardinality(self, outputs, targets, indices, num_boxes):
""" Compute the cardinality error, ie the absolute error in the number of predicted non-empty boxes
This is not really a loss, it is intended for logging purposes only. It doesn't propagate gradients
"""
pred_logits = outputs['pred_logits']
device = pred_logits.device
tgt_lengths = torch.as_tensor([len(v["labels"]) for v in targets], device=device)
# Count the number of predictions that are NOT "no-object" (which is the last class)
card_pred = (pred_logits.argmax(-1) != pred_logits.shape[-1] - 1).sum(1)
card_err = F.l1_loss(card_pred.float(), tgt_lengths.float())
losses = {
'cardinality_error': card_err}
return losses
def loss_boxes(self, outputs, targets, indices, num_boxes):
"""Compute the losses related to the bounding boxes, the L1 regression loss and the GIoU loss
targets dicts must contain the key "boxes" containing a tensor of dim [nb_target_boxes, 4]
The target boxes are expected in format (center_x, center_y, w, h), normalized by the image size.
回归loss
"""
assert 'pred_boxes' in outputs
idx = self._get_src_permutation_idx(indices)
src_boxes = outputs['pred_boxes'][idx]
target_boxes = torch.cat([t['boxes'][i] for t, (_, i) in zip(targets, indices)], dim=0)
loss_bbox = F.l1_loss(src_boxes, target_boxes, reduction='none')
losses = {
}
losses['loss_bbox'] = loss_bbox.sum() / num_boxes
loss_giou = 1 - torch.diag(box_ops.generalized_box_iou(
box_ops.box_cxcywh_to_xyxy(src_boxes),
box_ops.box_cxcywh_to_xyxy(target_boxes)))
losses['loss_giou'] = loss_giou.sum() / num_boxes
return losses
def loss_masks(self, outputs, targets, indices, num_boxes):
"""Compute the losses related to the masks: the focal loss and the dice loss.
targets dicts must contain the key "masks" containing a tensor of dim [nb_target_boxes, h, w]
"""
assert "pred_masks" in outputs
src_idx = self._get_src_permutation_idx(indices)
tgt_idx = self._get_tgt_permutation_idx(indices)
src_masks = outputs["pred_masks"]
src_masks = src_masks[src_idx]
masks = [t["masks"] for t in targets]
# TODO use valid to mask invalid areas due to padding in loss
target_masks, valid = nested_tensor_from_tensor_list(masks).decompose()
target_masks = target_masks.to(src_masks)
target_masks = target_masks[tgt_idx]
# upsample predictions to the target size
src_masks = interpolate(src_masks[:, None], size=target_masks.shape[-2:],
mode="bilinear", align_corners=False)
src_masks = src_masks[:, 0].flatten(1)
target_masks = target_masks.flatten(1)
target_masks = target_masks.view(src_masks.shape)
losses = {
"loss_mask": sigmoid_focal_loss(src_masks, target_masks, num_boxes),
"loss_dice": dice_loss(src_masks, target_masks, num_boxes),
}
return losses
def _get_src_permutation_idx(self, indices):
# permute predictions following indices
batch_idx = torch.cat([torch.full_like(src, i) for i, (src, _) in enumerate(indices)])
src_idx = torch.cat([src for (src, _) in indices])
return batch_idx, src_idx
def _get_tgt_permutation_idx(self, indices):
# permute targets following indices
batch_idx = torch.cat([torch.full_like(tgt, i) for i, (_, tgt) in enumerate(indices)])
tgt_idx = torch.cat([tgt for (_, tgt) in indices])
return batch_idx, tgt_idx
def get_loss(self, loss, outputs, targets, indices, num_boxes, **kwargs):
loss_map = {
'labels': self.loss_labels,
'cardinality': self.loss_cardinality,
'boxes': self.loss_boxes,
'masks': self.loss_masks
}
assert loss in loss_map, f'do you really want to compute {loss} loss?'
return loss_map[loss](outputs, targets, indices, num_boxes, **kwargs)
def forward(self, outputs, targets):
""" This performs the loss computation.
Parameters:
outputs: dict of tensors, see the output specification of the model for the format
targets: list of dicts, such that len(targets) == batch_size.
The expected keys in each dict depends on the losses applied, see each loss' doc
"""
# outputs是DETR模型的输出,是一个dict,形式如下:
# {'pred_logits':(b,num_queries=100,num_classes),
# 'pred_boxes':(b,num_queries=100,4),
# 'aux_outputs':[]}
outputs_without_aux = {
k: v for k, v in outputs.items() if k != 'aux_outputs'} # 过滤掉中间层的输出只保留最后一层的预测结果
# Retrieve the matching between the outputs of the last layer and the targets
# 模型输出的预测结果与GT进行匹配
indices = self.matcher(outputs_without_aux, targets)
# 包含多个元组的list,长度与batchsize相等。每个元组(index_i,index_j)前者为预测索引,后者为GT索引
# Compute the average number of target boxes accross all nodes, for normalization purposes
num_boxes = sum(len(t["labels"]) for t in targets)
num_boxes = torch.as_tensor([num_boxes], dtype=torch.float, device=next(iter(outputs.values())).device)
if is_dist_avail_and_initialized():
torch.distributed.all_reduce(num_boxes)
num_boxes = torch.clamp(num_boxes / get_world_size(), min=1).item()
# Compute all the requested losses
losses = {
}
for loss in self.losses:
losses.update(self.get_loss(loss, outputs, targets, indices, num_boxes))
# In case of auxiliary losses, we repeat this process with the output of each intermediate layer.
if 'aux_outputs' in outputs:
for i, aux_outputs in enumerate(outputs['aux_outputs']):
indices = self.matcher(aux_outputs, targets)
for loss in self.losses:
if loss == 'masks':
# Intermediate masks losses are too costly to compute, we ignore them.
continue
kwargs = {
}
if loss == 'labels':
# Logging is enabled only for the last layer
kwargs = {
'log': False}
l_dict = self.get_loss(loss, aux_outputs, targets, indices, num_boxes, **kwargs)
l_dict = {
k + f'_{i}': v for k, v in l_dict.items()}
losses.update(l_dict)
return losses
class PostProcess(nn.Module):
""" This module converts the model's output into the format expected by the coco api"""
# 此模块将模型的输出转换为coco api期望的格式
@torch.no_grad()
def forward(self, outputs, target_sizes):
""" Perform the computation
Parameters:
outputs: raw outputs of the model
target_sizes: tensor of dimension [batch_size x 2] containing the size of each images of the batch
For evaluation, this must be the original image size (before any data augmentation)
For visualization, this should be the image size after data augment, but before padding
"""
out_logits, out_bbox = outputs['pred_logits'], outputs['pred_boxes']
assert len(out_logits) == len(target_sizes)
assert target_sizes.shape[1] == 2
# (b,num_queries = 100,num_classes +1)
prob = F.softmax(out_logits, -1)
# (b, num_queries = 100),(b, num_queries = 100) # coco api 测评不包括背景类所以prob[..., :-1]是到-1
scores, labels = prob[..., :-1].max(-1)
# convert to [x0, y0, x1, y1] format
boxes = box_ops.box_cxcywh_to_xyxy(out_bbox)
# and from relative [0, 1] to absolute [0, height] coordinates
img_h, img_w = target_sizes.unbind(1)
scale_fct = torch.stack([img_w, img_h, img_w, img_h], dim=1)
# (b,num_queries = 100,4)*(b,1,4)
boxes = boxes * scale_fct[:, None, :]
results = [{
'scores': s, 'labels': l, 'boxes': b} for s, l, b in zip(scores, labels, boxes)]
return results
class MLP(nn.Module):
""" Very simple multi-layer perceptron (also called FFN)"""
def __init__(self, input_dim, hidden_dim, output_dim, num_layers):
super().__init__()
self.num_layers = num_layers
h = [hidden_dim] * (num_layers - 1)
self.layers = nn.ModuleList(nn.Linear(n, k) for n, k in zip([input_dim] + h, h + [output_dim]))
def forward(self, x):
for i, layer in enumerate(self.layers):
# 最后一层不使用relu
x = F.relu(layer(x)) if i < self.num_layers - 1 else layer(x)
return x
def build(args):
# the `num_classes` naming here is somewhat misleading.
# it indeed corresponds to `max_obj_id + 1`, where max_obj_id
# is the maximum id for a class in your dataset. For example,
# COCO has a max_obj_id of 90, so we pass `num_classes` to be 91.
# As another example, for a dataset that has a single class with id 1,
# you should pass `num_classes` to be 2 (max_obj_id + 1).
# For more details on this, check the following discussion
# https://github.com/facebookresearch/detr/issues/108#issuecomment-650269223
num_classes = 20 if args.dataset_file != 'coco' else 91
if args.dataset_file == "coco_panoptic":
# for panoptic全景, we just add a num_classes that is large enough to hold
# max_obj_id + 1, but the exact value doesn't really matter
num_classes = 250
device = torch.device(args.device)
backbone = build_backbone(args)
transformer = build_transformer(args)
model = DETR(
backbone,
transformer,
num_classes=num_classes,
num_queries=args.num_queries,
aux_loss=args.aux_loss,
)
if args.masks: # 设置了mask,代表分割任务
model = DETRsegm(model, freeze_detr=(args.frozen_weights is not None))
matcher = build_matcher(args)
weight_dict = {
'loss_ce': 1, 'loss_bbox': args.bbox_loss_coef}
weight_dict['loss_giou'] = args.giou_loss_coef
if args.masks:
weight_dict["loss_mask"] = args.mask_loss_coef
weight_dict["loss_dice"] = args.dice_loss_coef
# TODO this is a hack
if args.aux_loss:
aux_weight_dict = {
}
for i in range(args.dec_layers - 1):
# 为中间层的loss也加上权重
aux_weight_dict.update({
k + f'_{i}': v for k, v in weight_dict.items()})
weight_dict.update(aux_weight_dict)
losses = ['labels', 'boxes', 'cardinality']
if args.masks:
losses += ["masks"]
criterion = SetCriterion(num_classes, matcher=matcher, weight_dict=weight_dict,
eos_coef=args.eos_coef, losses=losses)
criterion.to(device)
postprocessors = {
'bbox': PostProcess()}
if args.masks:
postprocessors['segm'] = PostProcessSegm()
if args.dataset_file == "coco_panoptic":
is_thing_map = {
i: i <= 90 for i in range(201)}
postprocessors["panoptic"] = PostProcessPanoptic(is_thing_map, threshold=0.85)
return model, criterion, postprocessors
处理之后每个batch的大小可能不一样,batch里面的图像大小是一样的。当一个batch中的图片大小不一样的时候,我们要把它们处理的整齐,就是把图片都padding成最大的尺寸,padding的方式就是补零,每张图片对应一个mask矩阵padding的位置上为1
class NestedTensor(object):
def __init__(self, tensors, mask: Optional[Tensor]):
self.tensors = tensors
self.mask = mask
def to(self, device):
# type: (Device) -> NestedTensor # noqa
cast_tensor = self.tensors.to(device)
mask = self.mask
if mask is not None:
assert mask is not None
cast_mask = mask.to(device)
else:
cast_mask = None
return NestedTensor(cast_tensor, cast_mask)
def decompose(self):
return self.tensors, self.mask
def __repr__(self):
return str(self.tensors)
def nested_tensor_from_tensor_list(tensor_list: List[Tensor]):
# TODO make this more general
if tensor_list[0].ndim == 3:
if torchvision._is_tracing():
# nested_tensor_from_tensor_list() does not export well to ONNX
# call _onnx_nested_tensor_from_tensor_list() instead
return _onnx_nested_tensor_from_tensor_list(tensor_list)
# TODO make it support different-sized images
max_size = _max_by_axis([list(img.shape) for img in tensor_list])
# min_size = tuple(min(s) for s in zip(*[img.shape for img in tensor_list]))
batch_shape = [len(tensor_list)] + max_size
b, c, h, w = batch_shape
dtype = tensor_list[0].dtype
device = tensor_list[0].device
tensor = torch.zeros(batch_shape, dtype=dtype, device=device)
mask = torch.ones((b, h, w), dtype=torch.bool, device=device)
for img, pad_img, m in zip(tensor_list, tensor, mask):
pad_img[: img.shape[0], : img.shape[1], : img.shape[2]].copy_(img)
m[: img.shape[1], :img.shape[2]] = False
else:
raise ValueError('not supported')
return NestedTensor(tensor, mask)
略
提示:参考链接用Transformer做object detection:DETR
【论文笔记】从Transformer到DETR
End-to-End Object Detection with Transformers论文阅读笔记
object queries
position embedding
b站教学视频Transformer从零详细解读(可能是你见过最通俗易懂的讲解)
还可以看b站李宏毅老师的transformer原理
源码解读
讲transformer的输入和位置编码
Transformer-Encoder部分讲解和代码分析
nn.MultiheadAttention
匈牙利匹配算法