A majority of contemporary state-of-the-art object detectors follow a two-stage pipeline. First, they obtain bottom-up region proposals e.g., with RPN. Then, the localization accuracy is improved and each proposal is classified.
There emerges several active, squential search methods for object detection. The characteristics of this work include: (1) The proposed active search process is image- and category-depedent, controling where to look next and when to stop. (2) Aggregate context information. (3) The deep RL-based optimization problem link the detector network and search policy network together, where they can be jointly estimated. (4) exploration-accuracy trade-offs. (5) The learning process is only weakly supervised.
This work combines an RL-based top-down search strategy and a squential region proposal network together, implemented as a Conv-GRU and a two-stage bottom object detector.
policy模块的输入包括:
policy模块的输出是一个两通道的action column A t ∈ ℜ 25 × 25 × 2 \mathbf{A}_{t}\in \Re ^{25\times 25\times 2} At∈ℜ25×25×2,两个通道分别对应于固定动作 a t f a_{t}^{f} atf和完成动作 a t d a_{t}^{d} atd
细化具体做法:
我们可以选择固定点fixation并将对应的RoIs送入特定类别的预测器predictor中,然后使用local class-specific NMS来获取最显著的物体区域,得到最终预测的bounding box。将bounding box的空间位置映射到 V 4 0 V_{4}^{0} V40, V 4 t V_{4}^{t} V4t累积历史数值并计算平局值,用以表征policy模块已经观察过的区域。
根据 A t \mathbf{A}_{t} At,最终“完成层”为 a t d ∈ ℜ 25 × 25 × 2 \mathbf{a}_{t}^{d}\in \Re ^{25\times 25\times 2} atd∈ℜ25×25×2,将其拉成向量 d t ∈ ℜ 625 \mathbf{d}_{t}\in \Re ^{625} dt∈ℜ625,通过一个线性分类器决定终止的概率
在位置 z t = ( i , j ) z_{t}=(i,j) zt=(i,j)固定的概率为
drl-RPN维护着一个“RoI 观测值容器” R t ∈ { 0 , 1 } h × w × k R^{t}\in \{0,1\} ^{h\times w\times k} Rt∈{0,1}h×w×k,它初始化为全0,当且仅当对应位置的proposal被选中并输入到RoI Pooling和classification部分时,此框中所有的位置被设定为1,指示这个RoI已经被选择。
分离reward
drl-RPN可以被看作含有两个agent,drl_d和drl_f,本论文指出,我们不一定必须根据agent_f的表现来决定agent_d的reward,特别在训练初期,执行一次动作后agent_d和agent_f收到负的奖励值,此时应该只惩罚agent_f而不惩罚agent_d(因为agent_d给出了正确的决策)。
对于fixate action,本研究设定reward计算方式为:
对于done action,设定两种reward的计算方式,从中选择数值大的reward
在reward设计上可以参考:(1)对于中间探索步骤,考虑设定 I o U t i > I o U i ≥ τ IoU_{t}^{i}>IoU_{i}\geq \tau IoUti>IoUi≥τ,即对于中间步骤的准确率也有要求。(2)在初始阶段,actor网络和critic网络都不准确,如何使训练过程更稳定?
This work formulates the automatic image cropping problem as a seuqntial decision-making process and propose an aesthetics aware reinforcement learning model for weakly supervised image cropping problem.
A basic issue in computer vision is extracting effective descriptors, that is, strong discriminative power and low computation cost. This work aims to learn deep binary descriptors in a directed acyclic graph. The proposed GraphBit method represents bitwise interactions as edge between the nodes of bits.
利用interaction信息,处理不确定的node,使用RL进行add或remove
To tackle image restoration task, this work prepares a toolbox consists of small scale convolution networks of different complexity and specialized in different tasks, e.g., deblur, denoise, deJPEG. This work adopts deep Q-Learning to learn a policy to select appropriate tools from the toolbox, so as to progressively improve the quality of a corrupted image.
With the increasing requirement of photo retouching, the automatic image color enhancement becomes a valuable research issue. This task is non-trivial. Firstly, the translation of pixel values to the optimal states is a complex combination of pixel’s value and global/local color/contextual information. Besides, as a perceptual process, there could be various optimal strategies for different individual.
This work proposes a deep reinforcement learning for color enhancement. They cast this problem into a markov decision process where each step action is defined as a global color adjustment operation, e.g., brightness, contrast, white-balance changes. In this way, the proposed approach explicitly models iterative, step-by-step human retouch process.
Discard expensive paired datasets, this work proposes a ‘distort-and-recover’ scheme to alleviate the training process. They randomly distort the high-quality referenece images, then require the agent to recover it. This ‘distort-and-recover’ shceme is suitable for DRL approach.