2016 IEEE SPL 目前在middlebury上排名第二
解决window-based在视差不连续区域的不可靠性:One method to handle this trade-off is to make the window-based versatile to its input patterns [10], [11], [12]. making the shape of the matching template adaptive so that it can discard the information from the pixels that are irrelevant to the target pixel.
但是knowing the background pixels before the actual matching is difficult.
existing method are based on AlexNet 或者VGG ,这些都是为识别设计的而不是为匹配设计的。 这类CNN的困难在于增大patches的size
而patch的有效size又直接与感知野的空间区域联系, 并可以通过以下方式扩大:
1) include a few strided pooling /convolution layers
2)在每一层采用更大的卷积核
3)增加层数
然而,使用strided pooling 或者卷积层会让结果降采样,丢失一些细节信息。Although the resolution can be recovered by applying fractional-strided convolution [17], reconstructing small or thin structures is still difficult if once they are lost after downsampling.
关于matching cost的学习 [13,14,22]
[13] mc-cnn :11*11 window,没有使用池化,得到的cost比较noisy,,因此后面使用了CROSS-based cost ggregation+SGM
[14] learning to compare patches.. 采用了multiple pooling layers and spatial-pyramid-pooling (SPP) [24] to process larger patches.
但结果会引入fattening effect,这是由于pooling的信息丢失导致的。
本文的创新点:引入一个新的池化方法,可以在更大的感知野上处理而不丢失细节信息。
类似的尝试在语义分割中已经有所体现:[25,26,27] 这些方法都是将高层和底层的信息进行结合,使得object-level的信息能够精确到pixel-level
这些方法可以在大物体上取得较好的效果,但是对那些小的物体则失效。
FlowNet [28] 可以将low-level的flow上采样到原始尺寸。
与本文最接近的工作是【24】 (何恺明的SPP)
SPP中,去掉了卷积层之间的池化层,而先对几个卷积层串联输出的结果进行pool,得到high-level和mid-level的信息计算高度非线性的feature-map
尽管【14】也用到了SPP,但是它仍然有卷积层之间夹着的池化层,因此也是丢失了信息的。
输入: 两个patches
输出: matching cost
池化层的作用大家都知道,可以使map的尺寸指数缩小,缺点是在获得更大的感知野的同时丢失了一些细节信息。
采用大的池化窗口来替代一个带stride的小池化窗口,以达到同样大的感知野。
进行多个不同窗口尺寸的池化,并将输出连接得到新的feature maps
注意,这个multi-scale pooling operation 是对每个像素进行,而stride =1!
采用mc-cnn的框架,不同之处在于
1)patch size: 3737
2)只fine-tune 最后面三个11卷积层,这比随机初始化的效果好,
3) lr : 0.003->0.0003
4)后处理与mc-cnn一模一样
如何做到大尺寸窗口下在视差不连续处判断的准确性。(?待整理)
[10] K. Wang, “Adaptive stereo matching algorithm based on edge detection,” in ICIP, vol. 2. IEEE, 2004, pp. 1345–1348.
[11] K.-J. Yoon and I. S. Kweon, “Adaptive support-weight approach for correspondence search,” PAMI, vol. 28, no. 4, pp. 650–656, 2006.
[12] F. Tombari, S. Mattoccia, L. D. Stefano, and E. Addimanda, “Classification and evaluation of cost aggregation methods for stereo correspondence,” in CVPR. IEEE, 2008, pp. 1–8.
[13] J. ˇZbontar and Y. LeCun, “Stereo matching by training a convolutional neural network to compare image patches,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 2287–2318, 2016.
[14] S. Zagoruyko and N. Komodakis, “Learning to compare image patches via convolutional neural networks,” in CVPR, June 2015, pp. 4353–4361.
[17] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” arXiv preprint arXiv:1511.06434, 2015.
[18] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Object detectors emerge in deep scene cnns,” arXiv preprint arXiv:1412.6856, 2014.
[22] L. Ladick`y, C. H¨ane, and M. Pollefeys, “Learning the matching function,” arXiv preprint arXiv:1502.00652, 2015.
[24] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” in ECCV. Springer,
2014, pp. 346–361.
[25] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in CVPR, 2015, pp. 3431–3440.
[26] B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik, “Hypercolumns for object segmentation and fine-grained localization,” in CVPR, June 2015, pp. 447–456.
[27] H. Noh, S. Hong, and B. Han, “Learning deconvolution network for semantic segmentation,” arXiv preprint arXiv:1505.04366, 2015.