Detection of small objects in large swaths of imagery is one of the primary problems in satellite imagery analytics.
Object detection in ground-based imagery has benefited from research into new deep learning approaches, transitioning such technology to overhead imagery is nontrivial.
Faster R-CNN typically ingests 1000 × 600 pixel images
SSD:300 × 300 or 512 × 512
YOLO:416 × 416 or 544 × 544
None can come remotely close to ingesting the ~ 16,000×16,000 input sizes typical of satellite imagery.
Due to the speed, accuracy, and flexibility of YOLO, 作者的 framework 基于YOLO设计.
Excluding implementation details, algorithms must adjust for:
Small spatial extent(目标太小):small and densely clustered,在卫星图像中,感兴趣的物体相对尺寸都很小而且常常聚拢在一起,与ImageNet数据集中大范围的显著物体大不相同。同时物体的分辨率主要由地面采样距离决定,它定义了每个像素对应的物理长度。通常情况下卫星运行的高度是350km左右,最清晰的商用卫星图像可以达到30cm的GSD(每个像素对应30cm),而普通的数字卫星影响只能达到3-4m的分辨率了。所以对于车辆、船只这样的小物体来说可能只有10多个像素来描述;
Complete rotation invariance(要有旋转不变性):卫星图像中的物体具有各个方位的朝向,而ImageNet数据集中大多是竖直方向的,需要检测器具有旋转不变性;
Training example frequency(训练样本少):训练数据的缺乏,对于卫星图像缺乏高质量的训练数据,虽然SpaceNet已经进行了一系列有益的工作,但还需要进一步改进;
Ultra high resolution(图片太大):极高的图像分辨率,与通常输入的小图片不同,卫星图像动辄上亿像素,简单的将采样方法对于卫星图像处理无法适用。
文章的 contribution 就是 addresses each of these issues separately
Ground sample distance (GSD)
卫星图片上一个像素点代表真实世界的尺寸,比如 30cm GSD 就表示,图片上的一个像素点就为真实世界中的30cm
Commercially available imagery varies from 30 cm GSD for the sharpest Digital-Globe imagery, to 3-4 meter GSD for Planet imagery
That’s to say, cars each object will be only ~15 pixels in extent even at the highest resolution.
The proposed approach can rapidly detect objects of vastly different scales with relatively little training data over multiple sensors.
Left: Model applied to a large 4000 × 4000 pixel test image downsampled to a size of 416 × 416;(小目标没有了)none of the 1142 cars in this image are detected.
right:Model applied to a small 416 × 416 pixel cutout; the excessive false negative rate is due to the high density of cars that cannot be differentiated by the 13 × 13 grid.
Data augmentation + pre- and post-processing + 改进的YOLOv1
【YOLOv1】《You Only Look Once: Unified, Real-Time Object Detection》
consider the default YOLO network architecture, which downsamples by a factor of 32 and returns a 13 ×13 prediction grid;
缩小了 downsample 的倍数,加多的网络的层数
we implement a network architecture that uses 22 layers and downsamples by a factor of 16 Thus, a 416 × 416 pixel input image yields a 26 × 26 prediction grid.
激活函数用的 Leaky ReLUs
加了一个 pass through layer
激活函数ReLU、Leaky ReLU、PReLU和RReLU
YOLO v2之总结篇(linux+windows)
Nf=Nboxes∗(Nclass+5) N f = N b o x e s ∗ ( N c l a s s + 5 )
Nboxes N b o x e s is the number of boxes per grid(default is 5)
pre-processing 就是训练的时候 split 产生许多cutouts,有15%的 overlap
post-processing 就是测试时 把cutout 通过 NMS(非极大值抑制)合起来
An initial learning rate of 10−3 10 − 3 , a weight decay of 0.0005, and a momentum of 0.9.
Training takes 2 ~ 3 days on a single NVIDIA Titan X GPU.
不同分辨率下的cutout 如下,注意 0.15m GSD 的分辨率要高于0.30m GSD
作者在0.30m GSD下训练,在各种 分辨率下测试,结果如下
右边的纵轴是 Fc F c , Fc=Npredicted/Ntruth F c = N p r e d i c t e d / N t r u t h
作者也尝试了在不同分辨率下训练测试的结果,At each of the thirteen resolutionswe evaluate test scenes with a uniquemodel trained at that resolution.结果如下:
bottom of x 轴对应的是SGD,top of x轴对应的是car在相应SGD的大小(pixel),比如在0.15m的SGD下,一个Car 的大小大概为20个像素
