最近阅读大量的paper来寻找突破点,就不打算一篇一篇的写博文的,直接记录于此(比较有意思的paper会独立出来博客)
目录
《Scale-Transferrable Object Detection》
《Weakly Supervised Phrase Localization with Multi-Scale Anchored Transformer Network》
《Factorizable Net: An Efficient Subgraph-based Framework for Scene Graph Generation》
《Quantization Mimic: Towards Very Tiny CNN for Object Detection》
《End-to-End Learning of Deformable Mixture of Parts and Deep Convolutional Neural Networks for Human Pose Estimation》
《Multi-ScaleWeighted Nuclear Norm Image Restoration》
《Learning Cross-Modal Deep Representations for Robust Pedestrian Detection》
《Multi-scale Location-aware Kernel Representation for Object Detection》
《CRRN: Multi-Scale Guided Concurrent Reflection Removal Network》
《High-speed Tracking with Multi-kernel Correlation Filters》
《Optical Flow Guided Feature: A Fast and Robust Motion Representation for Video Action Recognition》
《Chained Cascade Network for Object Detection》
The feature chaining structure
《ViP-CNN: Visual Phrase Guided Convolutional Neural Network》
Triplet Proposal with NMS
《Distributable Consistent Multi-Object Matching》
《Learning with Side Information through Modality Hallucination》
《Deep Continuous Conditional Random Fields with Asymmetric Inter-object Constraints for Online Multi-object Tracking》
《Crowd Counting using Deep Recurrent Spatial-Aware Network》
《Hybrid Task Cascade for Instance Segmentation》
《WIDER Face and Pedestrian Challenge 2018 Methods and Results》
《SR-LSTM: State Refinement for LSTM towards Pedestrian Trajectory Prediction》
《Multi-person Articulated Tracking with Spatial and Temporal Embeddings》
《Libra R-CNN: Towards Balanced Learning for Object Detection》
Balanced Feature Pyramid
《Contextualized Spatial-Temporal Network for Taxi Origin-Destination Demand Prediction》
《Improving Action Localization by Progressive Cross-stream Cooperation》
《Fast Full-Search Equivalent Pattern Matching Using Asymmetric Haar Wavelet Packets》
《Intrinsic Image Transformation via Scale Space Decomposition》
《Context Contrasted Feature and Gated Multi-scale Aggregation for Scene Segmentation》
《Deep Kalman Filtering Network for Video Compression Artifact Reduction》
《Neural Network Encapsulation》
《Jointly learning deep features, deformable parts, occlusion and classification for pedestrian detection》
《Perceptual Image Enhancement by Relativistic Discriminant Learning With Cross-Scale Aggregated Representation》
《Cross Modal Distillation for Supervision Transfer》
《Zoom Out-and-In Network with Recursive Training for Object Proposal》
《Learning Deep Representations for Scene Labeling with Semantic Context Guided Supervision》
《Learnable Histogram: Statistical Context Features for Deep Neural Networks》
《FEATURE INTERTWINER FOR OBJECT DETECTION》
《DeepID-Net: Object Detection with Deformable Part Based Convolutional Neural Networks》
《Factorizable Net: An Efficient Subgraph-based Framework for Scene Graph Generation》
接下来要看的文献:
尺度可传递。
本文提出一个Scale-Transferrable Detection Network (STDN)检测多尺度目标。网络里面嵌入了超分层,来明确探索跨多个检测尺度的尺度间一致性。
为了获取高级语义多尺度特征(high-level semantic multi-scale feature maps)
The role of DenseNet is to integrate low-level and high-level features within a CNN to get more powerful features
STM consists of pooling and scaletransfer layers. Pooling layer is used to obtain small scale feature maps, and scale-transfer layer is used to obtain large scale feature maps.
网络结构如下图所示
所谓的Scale Transfer Module其实就是ESPCN里面的
那么这里其实也是采用了多尺度的信息来实现更好的效果
Multi-scale Anchored Transformer Network (MATN), to accurately localize free-form textual phrases with only image-level supervision.
从第二篇论文可以感受到,最近做multi-scale越来越多了
takes region proposals as localization anchors, and learns a multi-scale correspondence network to continuously search for phrase regions referring to the anchors.
生成场景图(scene graph)来描述图像中目标之间的交互关系。然而,most of the previous methods use complicated structures with slow inference speed or rely on the external data
本文就是采用一个非常小的网络来实现object detection。作者采用了一种称为Quantization Mimic的方法。(这篇论文应该对于做小网络、压缩模型的topic有作用)
人体姿态估计。
it is difficult to incorporate domain prior knowledge such as geometric relationships among body parts into DCNNs(将几何关系等先验知识结合是比较难的)由于身体的几何关系是确保关节一致性的,如下图所示
如果没有身体的几何关系的话,容易受到干扰,容易误检,但是考虑了后,就不会。这一点从某种程度上,可以认为给网络加入了更多的先验?强调了domain knowledge的意义
In this paper, we propose to incorporate the DCNN and the expressive mixture of parts model into an end-to-end framework. This enables us to predict the body part locations with the consideration of global pose configurations during the training stage, hence our framework is able to predict heat-maps with less false positives,
jointly learning the DCNN with the deformable model makes the feature learning more effective in handling the negative samples that are difficult when taking the full body pose into account. 此外,我们明确将人体姿势先验包括身体部位混合类型和标准二次变形约束纳入我们的模型。
contribution:
1、We design a novel message passing layer, which is flexible to build tree-structured models or loopy models with appearance mixtures.我们设计了一个新颖的消息传递层,它可以灵活地构建树形结构模型或具有外观混合的循环模型。
2、An end-to-end deep CNN framework for human pose estimation is proposed. By jointly learning DCNNs with
deformable mixture of parts models, global pose consistency is considered. Hence our framework is able to reduce the ambiguity and mine hard negatives effectively when learning features and part deformation.
3、整合了Domain knowledge。通过二次变形约束,在建模零件之间的空间和外观混合关系时减少参数空间。
A prominent property of natural images is that groups of similar patches within them tend to lie on low-dimensional subspaces.(自然图像的一个突出特性是其中的相似斑块组倾向于位于低维子空间上。)
capable of handling arbitrary degradations (e.g. blur, missing pixels, etc.).
论文提出了novel regularization term(正则化???)
inpainting(图像修补)
contribution:
1、利用了相似斑块组倾向于位于低维子空间上这一特性,采用了weighted nuclear norm minimization (WNNM) framework
2、small patches tend to recur not only within the same scale but also across different scales in natural images(小斑块往往不仅在相同的尺度内复现,而且在自然图像中也会在不同尺度上复现)
3、提出regularization term将所有patch组都考虑进来,并且利用expected patch log-likelihood (EPLL) approach,从而使得regularizer从数据中独立,从而保证了实现不同的复原任务
本文是做在不利的照明条件下行人检测
a novel cross-modality (跨模态) learning framework and it is based on two main phases.首先,给一个多模的数据集,一个CNN网络用于建立RGB与热力数据之间的非线性映射。所学习到的feature representations传到第二个CNN中,并输出检测的结果。从而保证了在照度差时,仍然可以实现行人检测。
learning deep representations from cross-modal data is greatly beneficial for detection and recognition tasks(从跨模态数据中学习deep representations对于检测和识别任务非常有益)然而,大部分方法都需要大型的标记数据集
作者提出了an approach for learning cross-modal representations for pedestrian detection which does not require pedestrian bounding box annotations.。利用来自多光谱数据的信息,采用CNN来学习RGB与热力图之间的非线性映射without human supervision。This crossmodal mapping is then exploited by integrating the learned representations into a second deep architecture, operating on RGB data and effectively modeling multi-scale information.(其实就是相当于要通过热力图和input才可以实现这个task,而热力图用一个network估计出来了,这样就直接input就可以完成这个task)方法如下图所示
如下图所示,通过探索多光谱信息,可以提高检测的精度
contributions
1、a novel approach for learning and transferring cross-modal feature representations for pedestrian detection.
There are two fundamental advantages in our strategy. First, multispectral data are not employed at the test phase. This is crucial when deploying robotics and surveillance systems, as only traditional cameras are needed, significantly decreasing costs. Second, no pedestrian annotations are required in the thermal domain. This greatly reduces human labeling efforts and permits to exploit large data collections of RGB-thermal image pairs.
2、实现了低照度下的行人检测。unsupervised cross-modal feature learning and for effectively transferring the learned representations
Learning and transferring cross-modal deep representations
reconstruct thermal data from RGB input and to transfer the learned crossmodal representations for the purpose of robust pedestrian detection.
第一个网络是Region Reconstruction Network (RRN) (is a fully convolutional network trained on pedestrian proposals collected from RGB thermal image pairs in an unsupervised manner)用于从RGB中获取热力图。
第二个网络是Multi-Scale Detection Network (MSDN),只有RGB数据,从RRN中转移参数并嵌入(embedding the parameters transferred from RRN)。MSDN takes a whole RGB image and a number of pedestrian proposals as input and outputs the detected bounding boxes with associated scores. In the test phase, detection is performed with MSDN and only RGB inputs are needed.
Region Reconstruction Network
RNN的设计,基于两种思考。
First, in order to avoid human annotation efforts, thermal information should be recovered with an unsupervised approach.While our approach uses the thermal image as deep supervision for the reconstruction task, it essentially requires only very weak supervision information
Second, as multispectral data are expected to be especially useful for hard positive and negative samples , instead of attempting to reconstruct the entire thermal images, it is more appropriate to specifically focus on bounding boxes which are likely to contain pedestrians.
Therefore, in this paper we propose to exploit a pretrained generic pedestrian detector to extract a set of pedestrian proposals (containing true positives and false positives) from RGB data and design a deep model which reconstructs the associated thermal information.
RNN的结构如下图所示
input of RRN is a three-channel RGB image and a set of associated pedestrian proposals。在RNN中包括:a frontend convolutional subnetwork (VGG-13 network structure) and a back-end reconstruction subnetwork。
关于ROI pooling可以参考https://blog.csdn.net/auto1993/article/details/78514071
https://www.cnblogs.com/wangyong/p/8523814.html
ROI pooling之后的feature map由反卷积upsample。
MultiScale Detection Network
MSDN is specifically designed to perform pedestrian detection from RGB images by exploiting the cross-modal representations learned with RRN.
a detection network which fuses multiple feature maps derived from ROI pooling layers.结构如下图所示
An RoI (Region of Interest) pooling layer is applied to the last two convolutional blocks to extract feature maps of size 512 × 7 × 7 for each pedestrian proposal.
The feature maps derived from the RoI pooling layers of the two sub-networks are then combined with a concatenation layer
The pretrained VGG-16 model is also utilized to initialize Sub-Net A. The convolutional layers of Sub-Net B are initialized with the corresponding parameters of RRN. Then, fine-tuning is performed using the RGB data of the target domain. The whole MSDN optimization is based on back-propogation with Stochastic Gradient Descent SGD.
Faster R-CNN只是探索了简单的一阶表示(object detection)而本文探索high-order statistics进而产生更多可以辨别的特征来增强performance,
a novel Multi-scale Location-aware Kernel Representation (MLKP) to capture high-order statistics of deep features in proposals.
通过多尺度位置感知内核表示,应该算是一种新的表示方式,来获得高阶特征。通过对multi-scale feature map的低维多项式核近似
high-order statistics representations can capture more discriminative information than first-order ones(如何获取高阶的表示?高阶的表示可以更好的利用特征?)
下图为论文所提出的网络的框架,利用来自buoy那个卷积block的features,然后concatenate他们为一个feature map。然后计算这个feature map的高阶特征。adopt polynomial kernel approximation based high-order methods, which can efficiently generate low-dimensional highorder representations(产生低维度的高阶特征)the kernel representation can be reformulated with 1 × 1 convolution operation followed by element-wise product.(用
首先通过一个修正的multi-scale feature map来有效利用多尺度的信息(a modified multi-scale feature map to effectively utilize multi-resolution information.)
然后通过一个低维的高阶特征(a lowdimensional high-order representation is obtained by polynomial kernel function approximation.)
进一步地,a trainable location-weight structure incorporated into polynomial kernel function approximation, resulting in a location-aware kernel presentation.
最后,实现object detection(这步可以忽略)
Multi-scale Feature Map
前面的卷积层具有更高的分辨率,可以有利于检测小目标。而在超分任务上,前面的feature maps是不是有利于恢复细节呢?
那么结合不同卷积层的feature map可以提升performance。而现有的multi-scale object detection网络都利用最后一层的feature maps。而本文不一样,如下图3所示。首先将相同conv block得不同conv层得feature map整合到一起。然后再将不同block得feature maps加到一起(有点像fishnet的操作)
Locationaware Kernel Representation
从通过玻璃拍摄的图像中去除不期望的反射对于各种计算机视觉任务具有广泛的应用。
Concurrent Reflection Removal Network (CRRN)
Motion representation(动作表示)做video 超分的可能这论文帮助比较大,这里我只关注网络结构
Optical Flow guided Feature (OFF)which enables the network to distill temporal information through a fast and robust approach.
extract featuresfrom multiple layers on a specific level with the same resolution by concatenating them together and feed them into one OFF unit
Cascade is a widely used approach that rejects obvious negative samples at early stages for learning better classifier and faster inference.
链式级联网络。在级联之前都处于浅层
the early cascade stage and contextual cascade stage are used for learning more and more discriminative features(辨别特征).通过拒绝网络中浅层的简单样本,在更深层或额外分支处学习的特征和分类器集中在更难的样本上。(By rejecting easy samples at shallow layers in the network, the features and classifiers learned at deeper layers or extra branches focus on harder samples.)
This paper adopts the fast RCNN framework for illustrating the CC-Net for object detection.下图为本文所提出网络的结构
Preparation of features with diversity(如何保证特征的丰富性是有效利用特征的一个方式。1、通过multi-scale来保证。2、通过保留特征,把前面的特征也用到最终的任务。3、通过subnet work产生有效利用的特征。然后就是特征的融合)
Multi-region, multi-context features were found to be effective
In order to obtain features with diversity, we apply roi-pooling from image features using different contextual regions and resolutions
the roi-pooled features have the same number of channels but have different sizes at different stages(就是由相同的channels,不同的size,那么就是通过多尺度来保证特征的多样性)These sizes are heuristically selected to have features with different contexts.
下图展示了contextual regions for features at different stages.These features are arranged with increasing contextual regions.
视觉短语引导
本文研究visual relationship detection
Phrase-guided Message Passing Structure (PMPS)
Deep networks typically pass information feedforward. In contrast, PMPS extracts useful information from other components in the phrase to refine the features before going to the next level.(从其他部分提取有用的信息,并且精化特征)
a new gather-broadcast message passing flow mechanism is proposed, which can be applied to various layers across deep models.一种新型的消息传遍的机制
model的overview如下图所示。VGG-Net作为基本的building block。模型分为两个主要的部分triplet proposal and phrase recognition.这两者share大部分的卷积层从而是的判断更快
Taking the output of the Conv4 3 as input, three convolutional layers are used for extracting CNN features. Then features are used for proposing class-free regions of interest (ROIs) using the approach of RPN。 By grouping these ROIs, triplet proposal is obtained.The predicate ROI is the box that tightly covers both the subject and the object. Due to the sparsity of relationship annotations, triplet non-maximum suppression (triplet NMS) is proposed to reduce the redundancy
首先采用RPN(is trained using both subject and object bounding boxes)来产生object proposals
Phraseguided Message Passing Structure
对于一般的CNN,只考虑视觉的依赖特性
Under this settings, connections among subject, predicate and object are omitted.为了应用其他模型的补偿信息,作者提出PMPS结构extracts useful information from the source branch to refine features of the destination branch.从源分支中提取有用信息以优化目标分支的功能。在深层模型中水平传递消息有助于它们相互改进。
Predicate captures the general information about the phrase while subject and object focus on the details.(Predicate捕获有关短语的一般信息,而subject和object则关注细节。)
To reflect the importance of the predicate within the triplet, we place the predicate at the dominant position and specifically design the gather-and-broadcast message passing flow.
In the message passing flow, the predicate first gathers the messages from the subject and object as follows
At the next layer, the predicate broadcasts message to the subject and the object as follows
the gather flow collects the information from subject and object to refine the visual features of the predicate.Then, in the broadcast flow, the global visual information of interaction (全局视觉交互信息) is broadcast back to the subject and object as context(上下文).
The gather-and-broadcast flow has sequential and parallel implementations,如下图示:
divide the input object collection into overlapping sub-collections and enforce map consistency among each sub-collection
(这篇论文通过设计loss来实现信息的交互,interesting!)
这篇论文好像跟前面介绍过的欧阳老师的论文《Learning Cross-Modal Deep Representations for Robust Pedestrian Detection》很像
Our hallucination network learns a new and complementary RGB image representation which is trained to mimic depth mid-level features. This new representation is combined with the RGB image representation learned through standard fine-tuning
The hallucination network takes as input an RGB image and a set of regions of interest and produces detection scores for each category and for each region.
To cause the depth modality to share information with the RGB modality through this hallucination network,为了使深度网络与RGB网络之间交换信息,作者在hallucination 和 depth layers之间加了回归损失。Essentially, this loss guides the hallucination network to extract features from an RGB image which mimic the responses extracted from the corresponding depth image.
Online Multi-Object Tracking (MOT)
Asymmetric Inter-object Constraints (不对称的对象间约束)
Deep Continuous Conditional Random Field(深度连续条件随机场,DCCRF)
The DCCRF consists of unary and pairwise terms.
The unary terms (一元) estimate tracked objects’ displacements across time based on visual appearance information(基于视觉外观信息跟踪对象在时间上的位移). They are modeled as deep Convolution Neural Networks, which are able to learn discriminative visual features for tracklet association. (跟踪相关的判别性视觉特征)
The asymmetric pairwise terms(不对称成对) model inter-object relations in an asymmetric way(以非对称方式模拟对象间关系), which encourages high-confidence tracklets to help correct errors of low-confidence tracklets (鼓励高可信度的tracklet帮助纠正低可信度跟踪的错误) and not to be affected by low-confidence ones much (不要受到低信任度的影响).
对于图像的人流统计,传统的方法采用固定的多尺度架构来解决这些挑战,这些架构通常无法覆盖很大范围的不同的尺度而忽略了旋转变化。
本文提出的结构具有learnable spatial transfor module with a region-wise refinement process.
Recurrent Spatial-Aware Refinement (RSAR) module iteratively conducting two components:
i) a Spatial Transformer Network that dynamically locates an attentional region from the crowd density map and transforms it to the suitable scale and rotation for optimal crowd estimation; (动态定位人群密度图中的注意区域,并将其转换为适当的比例和旋转,以进行最佳人群估计;)
ii) a Local Refinement Network that refines the density map of the attended region with residual learning.(通过residual learning来细化 attended region的密度图。)
Global Feature Embedding (GFE) module
takes the whole image as input for global feature extraction
transform the input image into high-dimensional feature maps,
Recurrent Spatial-Aware Refinement (RSAR) module
iteratively locate image regions with a spatial transformer-based attention mechanism and refine the attended density map region with residual learning.
Cascade is a classic yet powerful architecture that has boosted performance on various tasks
Hybrid Task Cascade (HTC)
we find that the key to a successful instance segmentation cascade is to fully leverage the reciprocal relationship between detection and segmentation(我们发现成功的实例分割级联的关键是充分利用检测和分割之间的相互关系)
1、首先不是分别对这两个任务进行级联细化,而是将它们交织在一起进行联合多阶段处理
2、a fully convolutional branch to provide spatial context,which can help distinguishing hard foreground from cluttered background.
a unified framework for multi-person pose estimation and tracking
Our framework consists of two main components, i.e. SpatialNet and TemporalNet
Compared with model architectures, the training process, which is also crucial to the success of detectors, has received relatively less attention in object detection.
In this work, we carefully revisit the standard training practice of detectors, and find that the detection performance is often
limited by the imbalance during the training process, which generally consists in three levels – sample level, feature level, and objective level. 这篇文章是研究train 的过程的
IoU-balanced sampling, balanced feature pyramid, and balanced L1 loss, respectively for reducing the imbalance at sample, feature, and objective level.
Sample level imbalance
hard samples are particularly valuable as they are more effective to improve the detection performance. However, the random sampling scheme usually results in the selected samples dominated by easy ones.
Feature level imbalance
Deep high-level features in backbones are with more semantic meanings while the shallow low-level features are more content descriptive(主干中的深层高级功能具有更多语义含义,而浅层低级功能则更具内容描述性)
These methods inspire us that the low level and high-level information are complementary for object detection.
The approach that how them are utilized to integrate the pyramidal representations determines the detection performance
Our study reveals that the integrated features should possess balanced information from each resolution.(我们的研究表明,综合特征应该包含每个分辨率的平衡信息。)那么到底什么为之balanced information呢?到此,论文还没有提到
But the sequential manner in aforementioned methods will make integrated features focus more on adjacent resolution (相邻分辨率) but less on others.包含在非相邻级别中的语义信息将在每次融合时被稀释一次
Objective level imbalance
A detector needs to carry out two tasks, i.e. classification and localization. Thus two different goals are incorporated in the training objective. If they are not properly balanced, one goal may be compromised, leading to suboptimal performance overall
本文所提出的网络分别对上述三个issue进行了探讨。由于本人此次调研仅仅关注feature maps,所以只关注第二点
balanced feature pyramid,which strengthens the multi-level features using the same deeply integrated balanced semantic features
our approach relies on integrated balanced semantic features to strengthen original features. In this manner, each resolution in the pyramid obtains equal information from others, thus balancing the information flow and leading the features more discriminative.
Different from former approaches that integrate multi-level features using lateral connections. We can strengthen the multi-level features using the same deeply integrated balanced semantic features.如下图4所示
Obtaining balanced semantic features. Features at resolution level l are denoted as Cl. The number of multi-level features is denoted as L. The indexes of involved lowest and highest levels are denoted as lmin and lmax. In Figure 4, C2 has the highest resolution. To integrate multi-level features and preserve their semantic hierarchy (语义层次) at the same time, we first resize the multi-level features C2; C3; C4; C5 to an intermediate (中等) size, such as the same size as C4, with interpolation (or up-sample) and max-pooling respectively. Once the features are rescaled, the balanced semantic features are obtained by simple averaging as
The obtained features are then rescaled using the same but reverse procedure to strengthen the original features. Each resolution obtains equal information from others in this procedure. Note that this procedure does not contain any parameter. We observe improvement with this nonparametric method, proving the effectiveness of the information flow.
Refining balanced semantic features. The balanced semantic features can be further refined to be more discriminative. Both the refinements with convolutions directly and the non-local module [32] work well. But the non-local module works more stable. Therefore, we can use the embedded Gaussian non-local attention as default in this paper. The refining step can enhance the integrated features and further improve the performance.
address this problem with a novel Contextualized Spatial-Temporal Network (CSTN), which consists of three components for the modeling of local spatial context (LSC), temporal evolution context (TEC) and global correlation context (GCC) respectively.
Firstly, a LSC module utilizes two convolution neural networks to learn the local spatial dependencies of taxi demand respectively from the origin view and the destination view. The output of the two networks would be combined to generate the final local spatial feature, which involves the hybrid information of taxi demand patterns from different views. Secondly, a TEC module incorporates both the local spatial features of taxi demand and the meteorological information to a CNN-LSTM network (convolutional long short-term memory network) for the analysis of taxi demand evolution. Thirdly, to capture the correlation between the far-apart regions, the GCC module computes the similarity between any two regions and generates the global correlation feature of each region by summing the features of all regions with the similarity weights. In this way, each region contains the information of all regions and it is mainly relevant to the regions that have high similarities with it. Finally, we integrate the local spatial-temporal feature generated by TEC module and the global correlation feature generated by GCC module to predict the future taxi origin-destination demand.
Inspired by the recent work [32], we capture the global correlation between all regions with a global feature fusion operation. Specifically, we generate the global correlation feature of each region as a weighted sum of all regional features, with the weights being calculated as the similarity between the corresponding region pairs. In this way, each region contains the information of all regions and it is mainly relevant to the regions of high similarities with it.
contributions
1、a new Progressive Cross-stream Cooperation (PCSC) framework to use both region proposals and features from one stream (i.e. Flow/RGB) to help another stream (i.e. RGB/Flow) to iteratively improve action localization results and generate better bounding boxes in an iterative fashion. To exploit the information from both streams at the region proposal level, we propose to combine the latest region proposals from both streams in order to collect a larger set of training samples.
2、(At the feature level)a new message passing approach to pass information from one stream to another stream in order to learn better representations, which also leads to better action detection models
a Progressive Cross-stream Cooperation (PCSC) model for action detection at the frame level. In this model, the RGB (appearance) stream and the flow (motion) stream iteratively help each other at both features level and region proposal level in order to achieve better localization results.
As shown in Fig. 1(a), our PCSC is composed of a set of “stages”. Each stage refers to one round of cross-stream cooperation(一轮的跨流交互), in which the features and region proposals from one stream will help improve the performance for another stream. Specifically, each stage comprises of two cooperation modules and a detection head module(which consists of several layers for region classification and regression).
For region-proposal-level cooperation,
We employ the two-stream Faster R-CNN method for frame-level action localization. Each stream has its own Region Proposal Network (RPN) to generate candidate action regions, and these candidates are then used as training samples to train a bounding box regression network for action localization. Based on our observation, the region proposals generated by either stream can only partially cover the true action regions, which degrades the detection performance. Therefore, in our model, we use the region proposals from one stream to help another stream. Here, the bounding boxes from RPN are called region proposals, while the bounding boxes from the detection head are called detection boxes.
The region proposals from the RPNs of the two streams are first used to train their own detection head separately in order to obtain their own corresponding detection boxes. The set of region proposals for each stream is then refined by combining two subsets. The first subset is from the detection boxes of its own stream (e.g, RGB) at the previous stage. The second subset is from the detection boxes of another stream (e.g, flow) at the current stage. To remove more redundant boxes, a lower NMS threshold is used when the detection boxes in another stream (e.g. flow) is used for the current stream (e.g. RGB).
As can been seen, with our approach, the diversity of region proposals from one stream will be enhanced by using the complementary boxes from another stream. This could help reduce the missing bounding boxes.
the detection results from one stream (e.g, the RGB stream) are used as the additional region proposals, which are combined with the region proposals from the other stream (e.g, the flow stream) to refine the region proposals and improve the action localization results.
Cross stream Feature Cooperation
Based on the refined region proposals, we also perform feature-level cooperation by first extracting RGB/flow features from these ROIs and refine these RGB/flow features via a message-passing module shown in Fig. 1(b), and Fig. 1(c),
The refined ROI features in turn lead to better region classification and regression results in the detection head module, which benefits the subsequent action localization process in the next stage.
Most method only exploit the complementary RGB and flow information by fusing softmax scores or concatenating the features from the final classifiers, which are insufficient for the features from the two streams to exchange information from one stream to another and benefit from such information exchange. Based on this observation, we develop a message passing module to bridge these two streams, so that they help each other for feature refinement.
Cross stream Feature Cooperation
Based on the refined region proposals, we also perform feature-level cooperation by first extracting RGB/flow features from these ROIs and refine these RGB/flow features via a message-passing module shown in Fig. 1(b), and Fig. 1(c),
The refined ROI features in turn lead to better region classification and regression results in the detection head module, which benefits the subsequent action localization process in the next stage.
Most method only exploit the complementary RGB and flow information by fusing softmax scores or concatenating the features from the final classifiers, which are insufficient for the features from the two streams to exchange information from one stream to another and benefit from such information exchange. Based on this observation, we develop a message passing module to bridge these two streams, so that they help each other for feature refinement.
Denote l as the index for the set of feature maps, message-passing module improves the features ClRGB with the help of the features ClFlow as follows:
where denotes the element-wise addition of the feature maps, is the mapping function (parameterized by ) of our message-passing module. The function nonlinearly extracts the message from the feature ClFlow and then use the extracted message for improving the features ClRGB. The output of has to produce the feature maps with the same number of channels and resolution as ClFlow and ClRGB. To this end, we design our message-passing module by stacking two 1X1 convolutional layer with relu as the activation function. The first 1X1 convolutional layers reduces the channel dimension and the second convolutional layer restores the channel dimension back to its original number. This design saves the number of parameters to be learnt in the module and exchange message by using two-layer non-linear transform.
propose a context contrasted local (CCL) model to obtain multi scale and multi-level context contrasted local features
Context Contrasted Local Feature
contributions:
A unified deep model for jointly learning feature extraction, a part deformation model, an occlusion model and classification.
operation in deep models by incorporating the deformation layer into the convolutional neural networks (CNN) ]. With this layer, various deformation handling (变形处理) approaches can be applied to our deep model.
propose a novel multi-level generator for image enhancement, which fully utilizes the features from all resolutions and has different abstraction levels in the expansive path.
inherits the structure of U-Net, which consists of a contracting path (encoder) and an expansive path (decoder). U-net concats the same scale features between these encoder and decoder, but this strategy is prone to limiting the image enhancement ability of the network without considering the richness of features. Therefore, we can use cross-scale feature aggregation (CFA) layer and aim to fully exploit the cross-scale feature learning for perceptual image enhancement, aggregating (聚集) the information flow of different scales from in contracting path to the expansive path.
Specifically, our generator has N scales in both encoder and decoder, respectively. In the encoder, we gradually down-sample feature maps as:
Where yencn-1 and yencn are the feature maps of n-1 th and n th encoder unit(scale) respectively, C indicates the convolutional feature extraction and is down-sampling operation. Each unit of the encoder uses 3X3 convolution filters for feature extraction and a 3X3 convolution operator with stride 2 for down-sampling.
In the decoder, we gradually up-sample feature maps as:
CFAn is the cross-scale aggregated feature operation in the proposed CFA layer of the n th scale. The up-sample operator is a deconvolution layer with 5X5 kernels. Next, We introduce the cross-scale features of the encoder into the decoder in the CFA layer as:
where Concat and R are the feature concatenation and refinement, respectively.
In this work we propose a technique that transfers supervision between images from different modalities.
We use learned representations from a large labeled modality as a supervisory signal for training representations for a new unlabeled paired modality.
有点类似前面的一篇学习深度图的
We utilize different resolutions of feature maps in the network to detect object instances of various sizes.
Two types of semantic context, scene names of images and label map statistics of image patches, are exploited to create label hierarchies between the original classes and newly created subclasses as the learning supervisions. Such subclasses show lower intra-class variation, and help CNN detect more meaningful visual patterns and learn more effective deep features
We demonstrate the effectiveness of learning deep feature representations in scene labeling by creating label hierarchies from semantic context as the strong supervision.(从语义上下文创建标签层次结构作为强有力的监督。)
Statistical features, such as histogram, Bag-of-Words (BoW) and Fisher Vector, were commonly used with hand-crafted features in conventional classi cation methods, but attract less attention since the popularity of deep learning methods.
propose a learnable histogram layer, which learns histogram features within deep neural networks in end-to-end training. Such a layer is able to back-propagate (BP) errors, learn optimal bin centers and bin widths, and be jointly optimized with other layers in deep networks during training. [1]
Global semantic context has been shown of great effectiveness in various classification problems including semantic segmentation, object detection and pose estimation. Histogram is one of the most-commonly-used conventional statistical features for describing context. However, such statistical features are little investigated by existing deep learning methods.
a learnable histogram layer that calculates histogram features for a likelihood map or vector.
the initial class likelihood map (stage-1 likelihood map) by the FCN-VGG to obtain the histogram features of the likelihood map of the whole image.
For each category, it is assumed that there are two feature sets: one with reliable information and the other with less reliable source. We argue that the reliable set could guide the feature learning of the less reliable set during training in spirit of student mimicking teacher’s behavior and thus pushing towards a more compact class centroid in the feature space.
It is well-known that objects of low resolution are more difficult to detect due to the loss of detailed information during network forward pass
We thus regard objects of high resolution as the reliable set and objects of low resolution as the less reliable set.
For each category, it is assumed that there are two feature sets: one with reliable information and the other with less reliable source. We argue that the reliable set could guide the feature learning of the less reliable set during training in spirit of student mimicking teacher’s behavior and thus pushing towards a more compact class centroid in the feature space.
It is well-known that an object of lower resolution will inevitably lose detailed information during the forward pass in the network. it is well-known that the network performance drops significantly as resolutions of objects decrease. For example, visual samples may be less reliable due to low
resolution, occlusion, adverse lighting, noise, blur, etc.
We thus regard objects of high resolution as the reliable set and objects of low resolution as the less reliable set. The learned features for samples from the reliable set are easier to classify than those from the less reliable one.
Specifically, an intertwiner is designed to minimize the distribution divergence between two sets. The choice of generating an effective feature representation for the reliable set is further investigated, where we introduce the optimal transport (OT,最优传输) theory into the framework. OT metric maps the comparison of two distributions on high-dimensional feature space onto a lower dimension space so that it is more sensible to measure the similarity between two distributions. For the feature intertwiner, OT is
capable of enforcing the less reliable set to be better aligned with the reliable set.
A network is divided into several levels based on the spatial size of feature maps. For each level l, we split the set of region proposals into two categories: one is the large-region set whose size is larger than the output size of RoI-pooling layer and another the small-region set whose size is smaller. These two sets correspond to the reliable and less reliable sets, respectively.
Feature map Pl at level l is fed into the RoI layer and then passed onto a make-up layer. This layer is designed to fuel back the lost information during RoI and compensate necessary details for instances of small resolution. The refined high-level semantics after this layer is robust to factors (such as pose, lighting, appearance, etc.) despite sample variations (尽管有样本变化). The make-up unit is learned and optimized via the intertwiner unit, with aid of features from the large object set, which is shown in the upstream (green) of Fig. 2.
The feature intertwiner is essentially a data distribution measurement to evaluate divergence between two sets. For the reliable set, the input is directly the outcome of the RoI layer of the large-object feature maps Pm|l, which correspond to samples of higher level/resolution. For the less reliable set, the input is the output of the make-up layer. Both inputs are fed into a critic module to extract further representation of these two sets and provide evidence (证据) for intertwiner. The critic consists of two convolutions that transfer features to a larger channel size and reduce spatial size to one, leaving out of consideration the spatial information(不考虑空间信息。)A simple l2 loss can be used for comparing difference between two sets. The final loss is a combination of the standard detection losses and the intertwiner loss across all levels.
The goal of the feature intertwiner is to have samples from less reliable set close to the samples within the same category (类别) from the reliable set. In one mini-batch, however, it often happens that samples from the less reliable set are existent while samples of the same category from the reliable set are non-existent (or vice versa). To address this problem, we use a buffer B to store the representative (prototype) for each category. Basically, the representative is the mean feature representation from large instances(实例)
ensuring that the semantic features within one category should be as much similar as possible despite the visual appearance variation caused by resolution change.
可变形的深度卷积神经网络
contributions:
A new deep learning framework for object detection. It effectively integrates feature representation learning, part deformation learning, context modeling, model averaging, and bounding box location refinement into the detection system.
A new scheme for pretraining the deep CNN model.
A new deformation constrained pooling (def-pooling) layer, which enriches the deep model by learning the deformation of object parts at any information abstraction levels.(一个新的变形约束池(def-pooling)层,它通过在任何信息抽象层次上学习对象部分的变形来丰富深层模型。)
Generating scene graph to describe the object interactions inside an image gains increasing interests these years
Thus, a natural idea is to construct a shared representation for the phrase features of similar regions in the early stage. Then the shared representation is refined to learn a general representation of the are by passing the message from the connected objects. In the final stage, we can extract the required information from this shared representation to predict object relations by combining with different
propose a subgraph-based scene graph generation approach, where the object pairs referring to the similar interacting regions are clustered into a subgraph and share the phrase representation (termed as subgraph features)
In this pipeline, all the feature refining processes are done on the shared subgraph features. This design significantly reduces the number of the phrase features in the intermediate stage (中间阶段) and speed up the model both in training and inference.
a bottom-up clustering method is proposed to factorize the image into subgraphs. By sharing the region representations within the subgraph, our method could significantly reduce the redundant computation (冗余计算) and accelerate the inference speed. In addition, fewer representations allow us to use 2-D feature map to maintain the spatial information for subgraph regions.
Framework
The entire process can be summarized as the following steps:
(1) generate object region proposals with Region Proposal Network (RPN);
Region Proposal Network is adopted to generate object proposals. It shares the base convolution layers with our proposed F-Net. An auxiliary convolution layer (辅助卷积层) is added after the shared layers. The anchors are generated by clustering the scales and ratios of ground truth bounding boxes in the training set [35].
(2) group the object proposals into pairs and establish the fully-connected graph, where every two objects have two directed edges to indicate their relations (表明他们的关系); As every two objects possibly have two relationships in opposite directions, we connect them with two directed edges (termed as phrases). A fully-connected graph is established, where every edge corresponds to a potential relationship (or background). Thus, N object proposals will have N(N − 1) candidate relations (yellow circles in Fig. 2 (2)). Empirically, more object proposals will bring higher recall and make it more likely to detect objects within the image and generate a more complete scene graph (完整的场景图). However, large quantities of candidate relations may deteriorate (恶化) the model inference speed. Therefore, we design an effective representation of all these relationships in the intermediate stage to adopt more object proposals.
(3) cluster the fully-connected graph into several subgraphs and share the subgroup features for object pairs within the subgraph, then a factorized connection graph is obtained by treating each subgraph as a node; By observing that many relations refer to overlapped regions (Fig. 1), we share the representations of the phrase region to reduce the number of the intermediate phrase representations as well as the computation cost. For any candidate relation, it corresponds to the union (联盟) box of two objects (the minimum box containing the two boxes). Then we define its confidence score as the product of the scores of the two object proposals. With confidence scores and bounding box locations, non-maximum-suppression (非最大抑制) can be applied to suppress the number of the similar boxes and keep the bounding box with highest score as the representative. So these merged (合并的) parts compose a subgraph and share an unified representation (统一代表) to describe their interactions. Consequently, we get a subgraph-based representation of the fully-connected graph: every subgraph contains several objects; every object belongs to several subgraphs; every candidate relation refers to one subgraph and two objects.
(4) ROI pools the object and subgraph features and transforms them into feature vectors and 2-D feature maps respectively;
After the clustering, we have two sets of proposals: objects and subgraphs. Then ROI-pooling is used to generate corresponding features. Different from the prior art methods which use feature vectors to represent the phrase features, we adopt 2-D feature maps to maintain the spatial information within the subgraph regions. (传统的任务中,会转换为特征项链,而这里用2D的feature来表示,从而保留了空间信息)As the subgraph feature is shared by several predicate inferences(谓语推论), 2-D feature map can learn more general representation of the region and its inherit spatial structure (继承空间结构) can help to identify the subject/object and their relations, especially the spatial relations. We continue employing the feature vector to represent the objects. Thus, after the pooling, 2-D convolution layers and fully-connected layers are used to transform the subgraph feature and object features respectively.
(5) jointly refine the object and subgraph features by passing message along the subgraph-based connection graph for better representations;
a spatial weighted message passing (SMP) structure is proposed to pass message between object feature vectors and sub-graph feature maps, and employ the spatial correspondence between the objects and the subgraph region.
Feature Refining with Spatial-weighted Message Passing
As object and subgraph features involve different semantic levels, where objects concentrate on the details and subgraph focus on their interactions, passing message between them could help to learn better representations by leveraging their complementary information. Thus, we design a spatial weighted message passing (SMP) structure to pass message between object feature vectors and subgraph feature maps (left part of Fig. 3). Messages passing from objects to subgraphs and from subgraphs to objects are two parallel processes. denotes the object feature vector and denotes the subgraph feature map.
(6) recognize the object categories with object features and their relations (predicates) by fusing the subgraph features and object feature pairs.
the Spatial-sensitive Relation Inference (SRI) module is designed to use the features from subject, object and subgraph representations for recognizing the relationship between objects. It fuses object feature pairs and subgraph features for the final relationship inference.
B. Hariharan, P. Arbel´aez, R. Girshick, and J. Malik. Hypercolumns for object segmentation and fine-grained localization. In CVPR, 2015. 2, 3
S. Cai, W. Zuo, and L. Zhang. Higher-order integration of hierarchical convolutional activations for fine-grained visual categorization. In ICCV, 2017. 2, 3, 4, 5
Y. Cui, F. Zhou, J. Wang, X. Liu, Y. Lin, and S. Belongie.
Kernel pooling for convolutional neural networks. In CVPR,
2017. 2, 3, 4
T.-Y. Lin, A. RoyChowdhury, and S. Maji. Bilinear CNN
models for fine-grained visual recognition. In ICCV, 2015.
3, 4
Q. Wang, P. Li, and L. Zhang. G2DeNet: Global Gaussian
distribution embedding network and its application to visual
recognition. In CVPR, 2017. 2, 3, 4
E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko. Simultaneous
deep transfer across domains and tasks. In International
Conference in Computer Vision (ICCV), 2015. 2, 3
Matthew D Zeiler and Rob Fergus. Visualizing and understanding
convolutional networks. In European Conference
on Computer Vision, 2014.
T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and
S. Belongie. Feature pyramid networks for object detection.
In The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), July 2017. 5, 6, 7, 8