Abstract. Image segmentation needs both local boundary position information and global object context information. The performance of the recent state-of-the-art method, fully convolutional networks, reaches a bottleneck due to the neural network limit after balancing between the two types of information simultaneously in an end-to-end training style. To overcome this problem, we divide the semantic image segmentation into temporal subtasks. First, we find a possible pixel position of some object boundary; then trace the boundary at steps within a limited length until the whole object is outlined. We present the first deep reinforcement learning approach to semantic image segmentation, called DeepOutline, which outperforms other algorithms in Coco detection leaderboard in the middle and large size person category in Coco val2017 dataset. Meanwhile, it provides an insight into a divide and conquer way by reinforcement learning on computer vision problems.
Keywords: Image Segmentation, Deep Reinforcement Learning, DeltaNet, DeepOutline
Fig. 1. Example outputs of the deep reinforcement learning agent – DeepOutline, which sequentially generates polygon vertices of the outline of the recognizable object in the image. The green polylines show the final results of our outline agent of person category, while the red lines show the errors. The numbers in the image are the indexes of vertices in the generating order starting from 1.
Introduction
Semantic image segmentation is a crucial and fundamental task for human being and higher animals, like mammals, birds, or reptiles, to survive [1]. It plays an irreplaceable role in the visual system in object localization, boundary distinguishing at pixel level. We intensively differentiate objects from each other in our everyday life and consider it as an easy and intuitive task. However, semantic image segmentation is still a challenging and not fully solved problem in the computer vision field.
With the breakthrough of deep learning [2] in the image recognition filed [3], many researchers began to apply it to their studies [4,5,6], making significant progress in the entire computer vision and machine learning community. However, in semantic image segmentation or other dense prediction tasks in computer vision, the degree of improvement did not meet that of image classification or object tracking. Indeed, the former task needs both global abstraction of the object and local boundary information. On the contrary, the Deep Convolutional Neural Network (DCNN) is successful in extracting features of different classes of objects, but it losses the local spatial information of where the border between itself and the background is.
Many works have been done using an end-to-end dense predictor to direct predict the pixel label through score map for each class. But it is more challenging to combine the two diverse goals into one end-to-end deep neural network: local spatial information lost for the layers near the output label layer, especially for some extreme case, like different objects with similar texture, color, and brightness. Even when human segment these objects from an image, they need to keep the object classes in mind and distinguish around adjoining parts between objects to find out and trace the boundary gradually. Some researchers seek to traditional computer vision method to overcome this problem, like Conditional Random Field (CRF) or hand-crafted features. But they need more sophisticated design than deep learning.
Such dilemma stems from the temporal combination of spatial information and global context. To solve this problem, we develop an end-to-end deep reinforcement learning network, called DeepOutline, dividing segmentation task into subtasks (Figure 1 shows some successful results of DeepOutline). Some of the questions we are trying to answer in this work are: Is it possible to imitate how human segment an image by outlining the semantic objects one by one? If we divide the segmentation task into steps, especially for the operation of finding the boundaries and finding the start point are quite different, can they share the same network? It is easy to represent discrete actions, but how can we represent continuous outline position in the image? Along with answering these questions, the contributions of this paper are threefold:
Apparently, using CRF technique after or inside the action could improve the final result. But we are interested in how reinforcement learning can be applied to image segmentation like tasks. So we prefer to use as easy action as we can rather than pursue a leaderboard position in this paper.
Related work
2.1 FCN related work
First, much of the current literature pays particular attention to fully connected DCNN [7]. When images are inputted to a neural network, both local boundary and global context information flow through the network. Due to the local information becoming vague with the downsampling operation, it is very intuitive to use the layers without or with less downsampling to directly pass the local information in the network for the final dense prediction output. Many studies have been done in this perspective, from U-net [8], which kept more precise position information before each downsample pooling and combined them with the more global context information passing back from the higher layers by upsampling operation, to the most recent fully convolutional DenseNets [9], which skipped connections between dense blocks [10]. While deconvolutional layer [7] and unpooling [11] tried to keep local position information by, respectively, transpose convolution and upsampling using previous pooling index to restore pooling context. Hyper column [12] was proposed to represent a pixel with deep features whose receptive field include it. Other approaches were explored on improving the performance of image segmentation methods, like Convolutional Feature Masking [13], Multi-task Netweork Cascades [14], Polygon-RNN[15], and SDS [16].
To improve the boundary accuracy, another approach focuses on putting post process after labeling by DCNN, including CRF or even most sophisticated hand-crafted features. CRFs were mainly used in the semantic segmentation field to combine multi-way classifier scores with the low-level information captured by the local interactions of pixels and edges [17,18] or superpixels [19]. Chen et al. [20] combined the responses at the final DCNN layer with a fully connected Conditional Random Field (CRF) [21], to further improve the localization performance.
2.2 Deep reinforcement learning
Not long after Deep Mind got the first success in using Deep Reinforcement Learning (DRL) to achieve human level players skill in playing video games [5] in 2015, DRL got a major improvement, like winning the best human go player [22], spreading quickly to other field, like visual navigation [23], Video face recognition [24], visual tracking [25], active object localization [26], image captioning [27], joint active object localization [28], semantic parsing of large scale 3D point cloud [29], understanding game policies at runtime (visually-grounded dialoging) [30], attention-aware face hallucination [31]. Existing works of DRL for image segmentation are less focusing on understanding the actions and temporal correlation between them. In this work, we alternatively explore ‘Divide and Conquer’ strategy in order to observe the characteristics of every atomic action human pursue for image segmentation through their eyes and the temporal correlation among them.
Fig. 2. A Delta-net structure of 2 layers. It can easily be stacked with more layers. Note that, the first input layer (down-left) of stacked Delta-net does not have the pooling, and the last layer ignore the lower-resolution input (up-right).
Fig. 3. DeepOutline network structure and data flow chart.
Conclusions
In this paper, we showed the first deep reinforcement learning method for image segmentation. We designed DeepOutline with simple actions by outlining the semantic objects one by one, and it outperforms other algorithms in Coco detection leaderboard in the middle and large size person category in Coco val2017 dataset. We discussed how the network behaves under different state maps and the causes of the three types of error. This paper also argued that the paradigm of divide and conquer is a promising approach to dense prediction problems.
Fig. 4. Evolution of position map during training. (a) is original image and (b) is the first state map of pen-up state generated by the training segmentation data. So the second state map is 0 and not shown here. (c) to (f) are maps of the first pen-down position of the second polygon at training round 20k, 40k, 60k and 100k respectively.