[VQA文献阅读] FloodNet: A High Resolution Aerial Imagery Dataset for Post Flood Scene Understanding

背景

文章题目:《FloodNet: A High Resolution Aerial Imagery Dataset for Post Flood Scene Understanding》

文章下载:https://arxiv.org/pdf/2012.02951.pdf

Abstract

Visual scene understanding is the core task in makingany crucial decision in any computer vision system. Al-though popular computer vision datasets like Cityscapes,MS-COCO, PASCAL provide good benchmarks for severaltasks (e.g. image classification, segmentation, object de-tection), these datasets are hardly suitable for post disas-ter damage assessments. On the other hand, existing natu-ral disaster datasets include mainly satellite imagery whichhave low spatial resolution and a high revisit period. There-fore, they do not have a scope to provide quick and effi-cient damage assessment tasks. Unmanned Aerial Vehicle(UAV) can effortlessly access difficult places during any disaster and collect high resolution image rythatis required for a fore-mentioned tasks of computer vision. To address theseissues we present a high resolution UAV imagery, FloodNet,captured after the hurricane Harvey. This dataset demon-strates the post flooded damages of the affected areas. Theimages are labeled pixel-wise for semantic segmentationtask and questions are produced for the task of visual ques-tion answering. FloodNet poses several challenges includ-ing detection of flooded roads and buildings and distin-guishing between natural water and flooded water. With theadvancement of deep learning algorithms, we can analyzethe impact of any disaster which can make a precise under-standing of the affected areas. In this paper, we compareand contrast the performances of baseline methods for im-age classification, semantic segmentation, and visual ques-tion answering on our dataset.

在任何计算机视觉系统中,视觉场景理解都是做出任何重要决策的核心任务。虽然流行的计算机视觉数据集,如城市景观,MS-COCO, PASCAL为一些任务(例如,图像分类,分割,目标检测)提供了很好的基准,但这些数据集很难适合灾后损害评估。另一方面,现有的自然灾害数据集主要包括低空间分辨率和高重访周期的卫星图像。因此,它们没有能力提供快速有效的损害评估任务。无人机(UAV)可以毫不费力地进入困难的地方,在任何灾难和收集大腿分辨率图像,所需的上述任务的计算机视觉。为了解决这些问题,我们展示了哈维飓风后拍摄的高分辨率无人机图像和洪水网。本数据集展示了受灾地区洪水过后的破坏情况。图像被标记为像素级的语义分割任务和问题产生的任务的视觉问题回答。洪水网提出了几个挑战,包括探测被淹没的道路和建筑物以及区分天然水和洪水。随着深度学习算法的进步,我们可以分析任何灾难的影响,可以对受灾地区做出精确的了解。在本文中,我们在我们的数据集上比较和对比了用于图像分类、语义分割和视觉问答的基线方法的性能。

Introduction

Understanding of a visual scene from images has thepotential to advance many decision support systems. Thepurpose of scene understanding is to classify the overallcategory of scene as well as constituting interrelationshipamong different object classes at both instance and pixellevel. Recently, several datasets [19, 48, 23] have been pre-sented to study different aspects of scenes by implement-ing many computer vision tasks. A major factor in successof most of the deep learning algorithms is the availabilityof large-scale dataset. Publicly available ground imagerydatasets such as ImageNet[19], Microsoft COCO[48], PAS-CAL VOC[23], Cityscapes[15] accelerate the advanced de-velopment of current deep neural networks, but the annota-tion of aerial imagery is scarce and more tedious to obtain.Aerial scene understanding dataset are helpful for urbanmanagement, city planning, infrastructure maintenance,damage assessment after natural disasters, and high def-inition (HD) maps for self-driving cars. Existing aerialdatasets, however, are limited mainly to classification [29,44] or semantic segmentation [29, 60] of few individualclasses such as roads or buildings. Moreover, all of thesedatasets are collected in normal conditions and computervision algorithms are mainly developed for normal lookingobjects. Most of these datasets do not address the uniquechallenges in understanding post disaster scenarios as a taskfor disaster damage assessment. For quick response and re-covery in large scale after a natural disaster such as hurri-cane, wildfire, and extreme flooding access to aerial imagesare critically important for the response team. To fill thisgap we present FloodNet dataset associated with three dif-ferent computer vision tasks namely classification, semanticsegmentation, and visual question answering.Although several datasets [8, 7, 30, 66] are provided forpost disaster damage assessments, they have numerous is-sues to tackle. Most of those datasets contain satellite im-ages and images collected from social media. Satellite im-ages are low in resolution and captured from high altitude.They are affected from several noises including clouds andsmokes. Moreover, deploying satellites and collecting im-ages from these are costly. On the other hand, imagesposted on social media are noisy and not scalable for deeplearning models. To address this issues, our dataset, Flood-Net, provides high resolution images taken from low alti-tude. These characteristics of FloodNet brings more clar-ity to scenes and thus help deep learning models in makingmore accurate decisions regarding post disaster damage as-sessment. In addition, most of tasks considering natural dis-aster datasets are restricted to mainly classification and ob-ject detection. Our dataset offers advanced computer visionchallenges namely semantic segmentation and visual ques-tion answering besides classification. All these three com-puter vision tasks can provide assistance in complete under-standing of a scene and help rescue team to manage theiroperation efficiently during emergencies. Figure 1 showssample annotations offered by FloodNet.Our contribution is two folds. First we introduce a highresolution UA V imagery named FloodNet for post disas-ter damage assessment. Secondly, we compare the perfor-mance of sevral classification, semantic segmentation andvisual question answering on our dataset. To the best ofour knowledge, this is the first VQA work focused on UAVimagery for any disaster damage assessment.The reminder of this paper is organized as follows: it be-gins with highlighting the existing datasets for natural disas-ter, semantic segmentation, and visual question answeringin section 2. Next, section 3 describes the FloodNet datasetincluding its collection and annotation process. Section 4describes the experimental setups for all three aforemen-tioned tasks along with complete result analysis of corre-sponding tasks. Finally section 5 summarizes the resultsincluding conclusion and future works.

从图像中理解视觉场景有潜力推进许多决策支持系统。场景理解的目的是对场景的整体类别进行分类,在实例级和像素级构成不同对象类之间的相互关系。 最近,一些数据集[19,48,23]被提出,通过实现许多计算机视觉任务来研究场景的不同方面。大多数深度学习算法成功的一个主要因素是大规模数据集的可用性。ImageNet[19]、Microsoft COCO[48]、PASCAL VOC[23]、Cityscapes[15]等公开的地面影像数据集加速了当前深度神经网络的先进发展,但航拍影像的注释稀缺,获取更加繁琐。航拍场景理解数据集有助于城市管理、城市规划、基础设施维护、自然灾害后的损害评估,以及用于自动驾驶汽车的高清地图。然而,现有的航空数据集主要局限于道路或建筑等少数个别类别的分类[29,44]或语义分割[29,60]。此外,所有这些数据集都是在正常情况下收集的,计算机视觉算法主要针对正常的物体。这些数据集中的大多数都没有解决将理解灾后情景作为灾害损害评估任务的独特挑战。为了在飓风、野火和极端洪水等自然灾害发生后迅速作出反应并进行大规模恢复,获取航拍图像对应急小组来说至关重要。为了填补这一空白,我们提出了与三种不同的计算机视觉任务相关的泛洪网数据集,即分类、语义分割和视觉问题回答。

虽然提供了一些数据集[8,7,30,66]用于灾后损害评估,但它们有许多问题需要解决。这些数据集大多包含卫星图像和从社交媒体收集的图像。卫星图像的分辨率很低,而且是从高空拍摄的。它们受到包括云层和烟雾在内的几种噪音的影响。此外,部署卫星并从这些卫星上收集图像是昂贵的。另一方面,社交媒体上发布的图片有噪声,不适合深度学习模型。为了解决这个问题,我们的数据集“洪水网”(FloodNet)提供了从低空拍摄的高分辨率图像。洪涝网的这些特性使场景更加清晰,从而帮助深度学习模型在灾后损害评估方面做出更准确的决策。此外,大多数考虑自然灾害数据集的任务主要局限于分类和目标检测。我们的数据集提供了高级的计算机视觉挑战,即除了分类之外的语义分割和视觉问题回答。所有这三种计算机视觉任务都可以帮助完全了解现场,并帮助救援队在紧急情况下有效地管理他们的操作。图1显示了由FloodNet提供的示例注释。

[VQA文献阅读] FloodNet: A High Resolution Aerial Imagery Dataset for Post Flood Scene Understanding_第1张图片

我们的贡献有两方面。首先介绍了一种用于灾后损失评估的高分辨率UA - V图像——洪水网。其次,我们在我们的数据集上比较了几种分类、语义分割和视觉问答的性能。据我们所知,这是第一次将无人机图像用于灾害损害评估的VQA工作。

本文的提醒内容组织如下:
第2节重点介绍了现有的自然灾害数据集、语义分割和视觉问答。
第3节描述了洪水网数据集,包括它的收集和注释过程。
第4节描述了上述三个任务的实验设置以及相应任务的完整结果分析。最后第五部分对研究结果进行了总结,包括结论和未来的工作。

2.Related Works

In this section we provide an overview of datasets designed for natural disasters damage analysis, followed by a survey of techniques targeting aerial and satellite image classification, segmentation, and VQA.

在本节中,我们将概述为自然灾害损害分析设计的数据集,然后对以航空和卫星图像分类、分割和VQA为目标的技术进行调查

2.1Datasets

Natural disaster dataset can be initially classified intotwo classes: A) Non-imaging dataset (text, tweets, socialmedia post) [35, 58] and B) Imaging datasets [60, 29, 12].Based on the image capture position existing imaging nat-ural disaster datasets can be further classified into threeclasses: B1) Ground-level images [54], B2) Satellite im-agery [12, 29, 22, 17, 14, 60], and B3) Aerial imagery[44, 78, 25]. Recently several datasets have been intro-duced by researchers for natural disaster damage assess-ment. Nguyen et al. proposed an extension of AIDR system[53] to collect data from social media in [54]. AIST BuildingChange Detection (ABCD) dataset has been proposed in[25] which includes aerial post tsunami images to identifywhether the buildings have been washed away. A combi-nation of SpaceNet [16] and DeepGlobe [18] was presentedin [22] and a segementation model was proposed to detectchanges in man-made structures to estimate the impact ofnatural disasters. Chen et al. in [12] proposed a fusionof different data resources for automatic building damagedetection after a hurricane. The dataset includes satelliteand aerial imageries along with vector data. Onera Satel-lite Change Detection (OSCD) dataset was proposed in [17]which consists of multispectral aerial images to detect ur-ban growth and changes with time. A collection of imagesof buildings and lands named Functional Map of the World(fMoW) was introduced by Christie et al. in [14]. AerialImage Database for Emergency Response (AIDER) is pro-posed by Kyrkou et al. in [44] for classification of UAVimagey. Rudner et al. [60] propose a satellite imagery col-lected from Sentinel-1 and Sentinel-2 satellites for semanticsegmentation of flooded buildings. Gupta et al. proposedxBD [29] which have both pre- and post-event satellite im-ages in order to assess building damages. Recently ISBDA(Instance Segmentation in Building Damage Assessment)is created by Zhu et al. in [78] for instance segmentationwhile images are collected using UA Vs.A comparative study among different disaster and nondisaster datasets is shown in Table 1. As you can see in Ta-ble 1, our dataset is the only high resulting UA V dataset col-lected after a hurricane which contains all computer visiontasks including classification, semantic segmentation, andVQA. Although several pre- and post-disaster datasets havebeen proposed over the years, these datasets are primarysatellite imageries. Satellite imageries, including thosewith high resolution, do not provide enough details aboutthe post disaster scenes which are necessary to distinguishamong different damage categories of different objects. Onthe other hand the primary source of the ground-level im-ageries is social media [54]. These imageries lack geo loca-tion tags [78] and suffers from data scarcity for deep learn-ing training [66]. Although some aerial datasets [44, 78] areprepared using UA Vs, these datasets lack low altitude highresolution images. AIDER [44] dataset collected imagesfrom different sources for image classification task and con-tains far more examples of normal cases rather than dam-aged objects; therefore lacks consistency and generaliza-tion. ISBDA [78] provides only building instance detectioncapability rather than inclusion of other damaged objectsand computer vision tasks like semantic segmentation andVQA. To address all these issues, FloodNet includes lowaltitude high resolution post disaster images annotated forclassification, semantic segmentation, and VQA. FloodNet provides more details about the scenarios which help toestimate the post disaster damage assessment more accurately.

自然灾害数据集最初可以分为两类:A)非成像数据集(文本、推文、社交媒体帖子)[35,58]和B)成像数据集[60,29,12]。基于图像捕获位置,现有的成像自然灾害数据集可以进一步分为三类:B1)地面图像[54],B2)卫星图像[12,29,22,17,14,60]和B3)航空图像[44,78,25]。最近,研究人员引入了几种用于自然灾害损害评估的数据集。Nguyen等人提出了AIDR系统[53]的扩展,从[54]的社交媒体中收集数据。在[25]中提出了AIST建筑变化检测(ABCD)数据集,其中包括航拍海啸后的图像,以识别建筑物是否已被冲走。结合SpaceNet[16]和DeepGlobe[18],在[22]中提出了一种检测人造结构变化的分段模型,以评估自然灾害的影响。[12]中的Chen等人提出了一种融合不同数据资源的方法,用于飓风后建筑物损伤的自动检测。数据集包括卫星和航空图像以及矢量数据。在[17]中提出了Onera卫星变化检测(OSCD)数据集,该数据集由多光谱航空图像组成,用于检测城市的增长和随时间的变化。Christie等人在[14]中引入了一组建筑和土地的图像,命名为世界功能地图(Functional Map of the World, fMoW)。kirkou等人在[44]中提出了用于无人机图像分类的航空应急图像数据库AIDER (air Image Database for Emergency Response)。Rudner等人[60]提出了Sentinel-1和Sentinel-2卫星采集的卫星图像用于洪水建筑的语义分割。Gupta等人提出了xBD[29],它具有事件前和事件后的卫星图像,以评估建筑物的损害。最近朱等人在[78]中创建了ISBDA (Instance Segmentation in Building Damage Assessment),用于实例分割。

不同灾害和非灾害数据集的对比研究如表1所示。如表1所示,我们的数据集是飓风后收集的唯一高结果UA V数据集,它包含所有的计算机视觉任务,包括分类、语义分割和VQA。尽管多年来已经提出了一些灾前和灾后的数据集,但这些数据集是主要的卫星图像。卫星图像,包括高分辨率的卫星图像,不能提供足够的关于灾后景象的细节,而这些细节对于区分不同物体的不同损害类别是必要的。另一方面,地面图像的主要来源是社交媒体[54]。这些图像缺乏地理位置标签[78],并且存在深度学习训练的数据稀缺问题[66]。虽然一些航空数据集[44,78]是使用UA Vs制作的,但这些数据集缺乏低空高分辨率图像。AIDER[44]数据集从不同来源收集图像用于图像分类任务,包含更多的正常情况的例子,而不是损坏的对象;因此缺乏一致性和泛化。ISBDA[78]只提供构建实例检测能力,而不包括其他受损对象和计算机视觉任务,如语义分割和VQA。为了解决所有这些问题,FloodNet包括用于分类、语义分割和VQA注释的低海拔高分辨率灾后图像。洪水网提供了更多关于场景的细节,这有助于更准确地估计灾后损失评估。

[VQA文献阅读] FloodNet: A High Resolution Aerial Imagery Dataset for Post Flood Scene Understanding_第2张图片

2.2. Algorithms

Here we review the related algorithms and some of their
applications in disaster damage assessment.

本文综述了相关算法及其在灾害损失评估中的应用。

2.2.1 Classification

The utility of Deep Neural Networks was realized when they achieved high accuracy in categorizing images into different classes. This was given a boost mainly by AlexNet[43] which achieved state-of-the-art performance on the ImageNet [20] dataset in 2012. As this is arguably the most primitive computer vision task, a lot of networks were proposed subsequently which could perform classification on public datasets such as CIFAR[42, 41], MNIST[47], and FashionMNIST [67].This led to a rise in networks such as [63], [32], [64],[13], [34] etc., where the network architectures were exper-imented with different skip connections, residual learning,multi-level feature extraction, separable convolutions, andoptimization methods for mobile devices. Although these networks achieved good performance on day to day imagesof animals and vehicles, they were hardly sufficient to make predictions on scientific datasets such as those captured byaerial or space borne sensors.In this regard, some image classification networks havebeen explored for the purpose of post-disaster damage de-tection [45, 53, 5, 61, 76]. [53] used crowd sourced imagesfrom social media which captured disaster sites from theground level. [5] used a Support V ector Machine on top ofa Convolutional Neural Network (CNN) followed by a Hid-den Markov Model post-processing to detect avalanches.[61] compared [63] and [32] for fire detection, but thenagain the dataset used contained images taken by hand-heldcameras on the ground. [76] developed a novel algorithmwhich focused on wildfire detection through UAV images.[45] have done extensive work by developing a CNN foremergency response towards fire, flood, collapsed build-ings, and crashed cars. Our paper can contribute in thisdomain by providing multi feature flooded scenes that caninspire the efficient training of more neural networks.

当深度神经网络能够实现对图像的高精度分类时,就实现了它的实用性。2012年,AlexNet[43]在ImageNet[20]数据集上实现了最先进的性能。由于这可能是最原始的计算机视觉任务,随后提出了许多可以对公共数据集进行分类的网络,如CIFAR[44,41]、MNIST[47]、FashionMNIST[67]。

这导致了[63]、[32]、[64]、[13]、[34]等网络的兴起,这些网络架构尝试了不同的跳跃连接、残差学习、多层次特征提取、可分离卷积以及针对移动设备的优化方法。尽管这些网络在动物和车辆的日常图像上取得了良好的性能,但它们几乎不足以对科学数据集(如由航空或空间传感器捕捉到的数据集)进行预测。

为此,一些图像分类网络被用于灾后损伤检测[45,53,5,61,76]。[53]使用来自社交媒体的人群图片,从地面捕捉灾难现场。[5]在卷积神经网络(CNN)上使用支持向量机,然后使用隐马尔可夫模型后处理来检测雪崩。[61]比较了[63]和[32]的火灾探测,但再次使用的数据集包含地面上手持摄像机拍摄的图像。[76]开发了一种基于无人机图像的火灾检测新算法。[45]已经做了大量的工作,通过开发一个CNN对火灾、洪水、倒塌的建筑物和撞车的汽车做出紧急反应。本文通过提供多特征淹没场景,激发更多神经网络的有效训练,在这一领域有一定的贡献。

2.2.2 Semantic segmentation

Semantic segmentation is one of the prime research area incomputer vision and an essential part of scene understand-ing. Fully Convolutional Network (FCN) [49] is a pioneer-ing work which is followed by several state-of-art mod-els to address semantic segmentation. From the perspec-tive of contextual aggregation, segmentation models can bedivided into two types. Models, such as PSPNet [74] orDeepLab [9,10] perform spatial pyramid pooling [28,46] atseveral grid scales and have shown promising results on sev-eral segmentation benchmarks. The encoder-decoder net-works combines mid-level and high-level features to obtaindifferent scale global context. Some notable works usingthis architecture are [10, 59]. On the other hand, there aremodels [75, 73, 24] which obtain feature representation bylearning contextual dependencies over local features.Besides proposing natural disaster datasets many re-searchers have also presented different deep learning mod-els for post natural disaster damage assessment. Authorsin [22] perform previously proposed semantic segmenta-tion [21] on satellite images to detect changes in the struc-ture of various man-made features, and thus detect areas ofmaximal impact due to natural disaster. Rahnemoonfar etal. present a densely connected recurrent neural network in[56] to perform semantic segmentation on UA V images forflooded area detection. Rudner et al. fuse multiresolution,multisensor, and multitemporal satellite imagery and pro-pose a novel approach named Multi3Net in [60] for rapidsegmentation of flooded buildings. Gupta et al. propose aDeepLabv3 [10] and DeepLabv3+ [11] inspired RescueNetin [31] for joint building segmentation and damage classi-fication. All these proposed methods address the semanticsegmentation of specific object classes like river, buildings,and roads rather than complete scene post disaster scenes.Above mentioned state-of-art semantic segmentationmodels have been primarily applied on ground based im-agery [15, 52]. In contrast we apply three state-of-art se-mantic segmentation networks on our proposed FloodNetdataset. We adopt one encoder-decoder based networknamed ENet [55], one pyramid pooling module based net-work PSPNet [74], and the last network model DeepLabv3+[11] employs both encoder-decoder and pyramid poolingbased module.

==语义分割是计算机视觉研究的重要领域之一,是场景理解的重要组成部分。==全卷积网络(FCN)[49]是一项开创性的工作,随后出现了几个处理语义分割的最新模型。从语境聚合的角度来看,分割模型可以分为两种类型。PSPNet[74]或DeepLab[9,10]等模型在几个网格尺度上执行空间金字塔池化[28,46],并在几个分割基准上显示了有前途的结果。编解码器网络结合中高层特征获取不同尺度的全局上下文。使用这种架构的一些著名作品是[10,59]。另一方面,有一些模型[75,73,24]通过学习局部特征的上下文依赖性来获得特征表示。除了提出自然灾害数据集,许多研究人员还提出了不同的深度学习模型用于灾害后的损失评估。[22]作者对卫星图像进行了之前提出的[21]语义分割,以检测各种人造地物的结构变化,从而检测出受自然灾害影响最大的区域。Rahnemoonfar等人在[56]中提出了一种密集连接的递归神经网络,对UAV图像进行语义分割,用于洪泛区检测。Rudner等人融合了多分辨率、多传感器和多时间卫星图像,并在[60]中提出了一种名为Multi3Net的新方法,用于快速分割被淹建筑物。Gupta等人在[31]中提出了DeepLabv3[10]和DeepLabv3+[11]启发的RescueNet用于关节构造分割和损伤分类。所有这些提出的方法都是针对特定的对象类,如河流、建筑物和道路,而不是完整的场景灾后场景的语义分割。上述最先进的语义分割模型主要应用于基于地面的图像[15,52]。相比之下,我们在我们提出的泛洪网数据集上应用了三种最先进的语义分割网络。我们采用一个基于编译码器的网络ENet[55],一个基于金字塔池模块的网络PSPNet[74],最后一个网络模型DeepLabv3+[11]同时采用了编译码器和金字塔池模块。

2.2.3 Visual Question Answering

Many researchers proposed several datasets and methodsfor Visual Question Answering task. However, there areno such datasets apt for training and evaluating VQA algo-rithms regarding disaster damage assessment tasks.To find the right answer, VQA systems need to model thequestion and image (visual content). Substantial researchefforts have been made on the VQA task based on real nat-ural and medical imagery in the computer vision and nat-ural language processing communities [4, 69, 38, 27] us-ing deep learning-based multimodal methods [50, 68, 26, 3,70, 72, 6, 39, 71]. In these methods, different approachesfor the fined-grained fusion between semantic features ofimage and question have been proposed. Most of the re-cent VQA algorithms have trained on natural image baseddatasets such as DAQUAR[62], COCO-VQA [4], VisualGenome[40], Visual7W [79]. In addition Path-VQA [33]and VQA-MED [2] are medical images for which VQAalgorithms are also considered. In this work, we presentFloodNet dataset to build and test VQA algorithms that canbe implemented during natural emergencies. To the bestof our knowledge, this is the first VQA dataset focused onUAV imagery for disaster damage assessment. To evaluatethe performances of existing VQA algorithms we have im-plemented baseline models, Stacked Attention network[69],and MFB with Co-Attention[71] network on our dataset.

许多研究者提出了几种视觉问答任务的数据集和方法。然而,目前还没有适合于针对灾害损害评估任务的VQA算法训练和评估的数据集。为了找到正确的答案,VQA系统需要对问题和图像(视觉内容)进行建在计算机视觉和自然语言处理社区中,基于真实自然图像和医学图像的VQA任务已经进行了大量研究[4,69,38,27],使用基于深度学习的多模态方法[50,68,26,3,70,72,6,39,71]。在这些方法中,针对图像和问题的语义特征的细粒度融合提出了不同的方法。 最近的大多数VQA算法都是在基于自然图像的数据集上训练的,如DAQUAR[62]、COCO-VQA[4]、Visual Genome[40]、Visual7W[79]。另外Path-VQA[33]和VQA- med[2]是医学图像,也考虑了VQA算法。在这项工作中,我们提供了洪水网数据集来建立和测试可以在自然紧急情况下实现的VQA算法。据我们所知,这是第一个用于灾害损害评估的无人机图像VQA数据集。为了评估现有的VQA算法的性能,我们在我们的数据集上实现了==基线模型、堆叠注意力网络[69]和MFB与共同注意力[71]==网络。

3. The FloodNet Dataset

The data is collected with small UA V platform, DJIMavic Pro quadcopters, after Hurricane Harvey. HurricaneHarvey made landfall near Texas and Louisiana on August,2017, as a Category 4 hurricane. The Harvey dataset con-sists of video and imagery taken from several flights con-ducted between August 30 - September 04, 2017, at FordBend County in Texas and other directly impacted areas.Thedatasetisuniquefortworeasons. Oneisfidelity: itcontains imagery from sUA V taken during the response phaseby emergency responders, thus the data reflects what is thestate of the practice and can be reasonable expected to becollected during a disaster. Second: it is the only knowndatabase of sUAV imagery for disasters. Note that thereare other existing databases of imagery from unmanned andmanned aerial assets collected during disasters, such as Na-tional Guard Predators or Civil Air Patrol, but those arelarger, fixed-wing assets that operate above the 400 feetAGL (Above Ground Level), limitation of sUAV. All flightswere flown at 200 feet AGL, as compared to manned assetswhich normally fly at 500 feet AGL or higher. Such imagesare very high in resolution, making them unique comparedto other data sets for natural disasters. The post-floodeddamages to affected areas are demonstrated in all the im-ages. There are several objects (e.g. construction, road) andrelated attributes ( e.g. state of an object such as flooded ornon-flooded after Hurricane Harvey) represented by theseimages. For the preparation of this dataset for semantic seg-mentation and visual question answering, these attributesare considered.

数据是在飓风“哈维”之后用小型UAV平台大疆Mavic Pro四轴飞行器收集的。飓风哈维于2017年8月在德克萨斯州和路易斯安那州附近登陆,是四级飓风。“哈维”数据集由2017年8月30日至9月4日在德克萨斯州福特本德县和其他直接受影响地区进行的几次飞行拍摄的视频和图像组成。这个数据集之所以独一无二,有两个原因。一个是忠诚:它包含应急响应人员在响应阶段拍摄的sUA V图像,因此数据反映了实践的状态,在灾难期间可以合理地预期收集。第二:它是唯一已知的用于灾难的sUAV图像数据库。需要注意的是,在灾难期间,还有其他从无人和有人驾驶空中资源收集的图像数据库,如国民警卫队的捕食者或民用空中巡逻,但这些都是更大的固定翼资产,在400英尺的AGL(地面以上)以上操作,sUAV的局限性。所有的飞行都是在200英尺的空中飞行,而载人飞机通常在500英尺或更高的空中飞行。这些图像的分辨率非常高,与其他自然灾害数据集相比,它们是独一无二的。所有的图片都展示了洪水过后受灾地区所遭受的破坏。有几个对象(如建筑,道路)和相关的属性(如一个对象的状态,如在飓风哈维后被淹没或未被淹没)代表这些图像。为了准备这个用于语义分割和视觉问答的数据集,考虑了这些属性。

3.1. Annotation Tasks

After natural disasters, the response team first need toidentify the affected neighborhoods such as flooded neigh-borhoods (classification tasks). Then on each neighborhoodthey need to identify flooded buildings and roads (seman-tic segmentation) so the rescue team can be sent to af-fected areas. Furthermore, damage assessment after anynatural calamities done by querying about the changes inobject’s condition so they can allocate the right resources.Based on these needs and with the help of response andrescue team, we defined classification, semantic segmen-tation and VQA tasks. In total 3200 images have beenannotated with 9 classes which include building-flooded,building-non-flooded, road-flooded, road-non-flooded, wa-ter, tree, vehicle, pool, and grass. A buildings is classified asflooded when at least one side of a building is touching theflood water. Although we have classes created for floodedbuildings and roads, to distinguish between natural waterand flood water, “water” class has been created which rep-resents any natural water body like river and lake. For theclassification task, each image is classified either “flooded”or “non-flooded”. If more than 30% area of an image is oc-cupied by flood water then that area is classified as flooded,otherwise non-flooded. Number of images and instancescorresponding to different classes are shown in Table 2. ourimages are quite dense. On average, it take about one hourto annotate each image. To ensure high quality, we per-formed the annotation process iteratively with a two-levelquality check over each class. The images are annotated onV7 Darwin platform [1] for classification and semantic seg-mentation. We split the dataset into training, validation, andtest sets with 70% for training and 30% for validation andtesting. The training, validation, and testing sets for all thethree tasks will be publicly available.

自然灾害发生后,响应小组首先需要确定受影响的社区,如被洪水淹没的社区(分类任务)。然后在每个社区,他们需要识别被洪水淹没的建筑物和道路(语义分割),这样救援队伍就可以被派往受影响的地区。此外,在任何自然灾害发生后,通过查询对象状况的变化来进行损害评估,以便他们能够分配正确的资源。基于这些需求,在响应救援团队的帮助下,我们定义了分类、语义分割和VQA任务。在3200张图像中标注了9个类别,包括建筑物淹水、建筑物非淹水、道路淹水、道路非淹水、水、树、车辆、水池和草地。当建筑物的至少一侧接触到洪水时,建筑物就被列为洪水。虽然我们为洪水建筑和道路创建了类,以区分自然水和洪水,“水”类已经创建,代表任何自然水体,如河流和湖泊。在分类任务中,每个图像被分类为“泛洪”或“非泛洪”。如果超过30%的图像区域被洪水占据,那么该区域被划分为被淹没,否则非被淹没。 不同类对应的图像和实例数量如表2所示。我们的图像非常密集。平均来说,注释每张图片大约需要一个小时。为了确保高质量,我们迭代地执行注释过程,对每个类进行两级质量检查。图像在V7达尔文平台[1]上进行标注,进行分类和语义分割。我们将数据集分为训练、验证和测试集,其中70%用于训练,30%用于验证和测试。 所有这三个任务的培训、验证和测试集都将公开提供。

[VQA文献阅读] FloodNet: A High Resolution Aerial Imagery Dataset for Post Flood Scene Understanding_第3张图片

3.2. VQA task

To provide VQA framework, we focus on generatingquestions related to the building, road, and entire image as awhole for ourFloodNet dataset. By asking questions relatedto these object we can assess the damages and understandthe situation very precisely. Attribute associated with afore-mentioned objects can be identified from the Table 2. Forthe FloodNet-VQA dataset, ∼ 11,000 question-image pairsare considered while training VQA networks. All the ques-tions are created manually. Each image has an average of3.5 questions. Each of the questions is designed to provideanswers which are connected to the local and global regionsof images. In Figure 1, some sample questions-answer pairsare presented from our dataset.

==为了提供VQA框架,我们将重点放在为我们的洪涝网数据集生成与建筑物、道路和整个图像相关的问题上。通过询问与这些物品有关的问题,我们可以评估损害并非常准确地了解情况。==可以从表2中识别与上述对象相关联的属性。==对于泛洪网络-VQA数据集,在训练VQA网络时考虑了约1.1万对问题-图像对。所有的问题都是手工创建的。每张图片平均有3.5个问题。每个问题的设计都是为了提供与图像的本地和全球区域相联系的答案。==在图1中,从我们的数据集中提供了一些示例问题-答案对。

3.2.1 Types of Question

Questions are divided into a three-way question group,namely “Simple Counting”, “Complex Counting”, and“Condition Recognition”. In the Figure 2, distribution ofthe question pattern based on the first words of the questionsis given. All of the questions start with a word belongs tothe set {How, Is, What}. Maximum length of question is11.In the Simple Counting problem, we ask about an ob-ject’s frequency of presence (mainly building) in an image,regardless of the attribute (e.g. How many buildings are inthe images?). Both flooded and non-flooded buildings canappear in a picture in several cases (e.g. bottom image fromFigure 1).The question type Complex Counting is specifically in-tended to count the number of a particular building attribute(e.g. How many flooded / non-flooded buildings are in theimages?) We’re interested in counting only the flooded ornon-flooded buildings from this type of query. In compar-ison to simple counting, a high-level understanding of thethe scene is important for answering this type of question.This type of question also starts with the word “How”.Condition Recognition questions investigate the condi-tion of the entire image as a whole or the road. This typeof question is divided into three sub-categories. One cate-gory deals with the condition of road by asking questionssuch as “What is the condition of the road?”. Second oneseeks the condition of the entire image by asking ques-tions like “What is the overall condition of the entire im-age?”. “Y es/No” type question is categorised as the thirdsub-category of the Condition Recognition. “Is the roadflooded?”, “Is the road non-flooded” are some of the ex-amples from this sub-category. Starting word for this typeof question is either “Is” or “What”.

将问题划分为==“简单数数”、“复杂数数”、“条件识别”==三个问题组。图2给出了基于问题首词的问题模式分布情况。==所有的问题都以一个属于集合{How, Is, What}的单词开始。问题的最大长度是11。在简单的计数问题中,我们询问对象在图像中出现的频率(主要是建筑),而不考虑其属性(例如图像中有多少建筑?)在几种情况下,洪水淹没的建筑和非洪水淹没的建筑都可以出现在一张图片中(例如,图1的底部图片)。问题类型复杂计数是为了计算特定建筑属性的数量(例如,图片中有多少座洪水淹没/非洪水淹没的建筑?)我们感兴趣的是从这种类型的查询中只计算被淹没或未被淹没的建筑。==与简单的数数相比,对场景的高层次理解对于回答这类问题是很重要的。这类问题也以“How”开头。

[VQA文献阅读] FloodNet: A High Resolution Aerial Imagery Dataset for Post Flood Scene Understanding_第4张图片

条件识别问题调查整个图像作为一个整体或道路的状况。这类问题分为三类。一类是关于道路状况的问题,如“道路状况如何?”第二种是通过问诸如“整幅图像的整体状况是什么?”这样的问题来寻求整幅图像的状况。“Y es/No”类型的问题被归类为条件识别的第三个子类别。“路被淹了吗?””、“道路是否未被洪水淹没”是这个子类中的一些例子。这类问题的开头词可以是“is”或者“What”。

3.2.2 Types of Answer

Both flooded and non-flooded buildings can exist in anyimage. Forcomplexcountingproblem, weonlycounteitherthe flooded or non-flooded buildings from a given image-question pair. Roads are also annotated as flooded or non-flooded. Second image from the Figure 1 depicts bothflooded and non-flooded roads. Thus, the answer for thequestion like “What is condition of road?” for this kindof images will be both ‘flooded and non-flooded’. Fur-thermore, entire image may be graded as flooded or non-flooded. Table 3 refers to the possible for threetypes questions and from Figure 2, we can see the possi-ble answer distribution for different types of question. Mostfrequent answers for counting problem, in general, are ‘4, 3,2, 1’ whereas ‘27, 30, 41, 40’ are the less frequent answers.For Condition Recognition problem, ‘non-flooded, yes’ arethe most common answers.

被淹没和未被淹没的建筑都可以存在于任何图像中。对于复杂的计算问题,我们只从给定的图像问题对中计算被水淹或未被水淹的建筑物。道路也被标注为淹水或未淹水。图1中的第二幅图描绘了被洪水淹没的道路和未被洪水淹没的道路。因此,这个问题的答案就像“什么是道路状况?”“因为这类图像将既是‘洪水泛滥的,也是非洪水泛滥的’。此外,整个图像可以被分级为淹没或非淹没。表3为三种类型的问题可能的答案,从图2可以看出不同类型的问题可能的答案分布情况。一般来说,计数题最常见的答案是“4,3,2,1”,而较少出现的答案是“27,30,41,40”。对于条件识别问题,“非泛洪,是”是最常见的答案。

[VQA文献阅读] FloodNet: A High Resolution Aerial Imagery Dataset for Post Flood Scene Understanding_第5张图片

4. Experiments

understand the usability of these images for flooddetection, we majorly carry out three tasks, which areImage Classification, Semantic Segmentation, and VisualQuestion Answering (VQA). We begin with classifying theFloodNet data into Flooded and Non-Flooded images, thenwe detect specific regions of flooded buildings, floodedroads, vehicles etc. through semantic segmentation net-works. Finally, we carry out VQA on this dataset. For allof our tasks, we use NVIDIA GeForce RTX 2080 Ti GPUwith an Intel Core i9 processor.For image classification, we used three state-of-the-artnetworks i.e. InceptionNetv3 [65], ResNet50 [32], andXception [13] as base models to classify the images intoFlooded and Non-Flooded categories. These networks havesignificantly contributed to the field of Computer Vision byintroducing a unique design element, such as the residualblocks in ResNet, the multi-scale architecture in Inception-Net and depthwise separable convolutions in Xception. Forour classification task, the output from these base modelswas followed by a Global Average Pooling Layer, a fullyconnected layer with 1024 neurons having Relu Activation,and finally by two neurons with Softmax activation. We initialized our networks with ImageNet [20] weights andtrained them for 30 epochs, with 20 steps for every epoch,using binary cross entropy loss.For semantic segmentation, we implemented three meth-ods, i.e. PSPNet [74], ENet [55], and DeepLabv3+ [11];and evaluate their performance on FloodNet dataset. Forimplementing PSPNet, ResNet101 was used as backbone.We used “poly” learning rate with base learning rate 0.0001.Momentum, weight decay, power, and weight of the auxil-iary loss were set to 0.9, 0.0001, 0.9, and 0.4 respectively.For ENet we used 0.0005 and 0.1 for learning rate and learn-ing rate decay respectively. Weight decay was set to 0.0002.Similarly for DeepLabv3+ we used poly learning rate withbase learning rate 0.01. We set weight decay to 0.0001 andmomentum to 0.9. For image augmentation we used ran-dom shuffling, scaling, flipping, and random rotation whichhelped the models avoid overfitting. From different experi-ments it was proved that larger “crop size” and “batch size”improve the performance of the models. During training,we resized the images to 713 × 713 since large crop sizeis useful for the high resolution images. For semantic seg-mentation evaluation metric we used mean IoU (mIoU).For Visual Question Answering, simple baselines(concatenation/element-wise product of image and text fea-tures) and Multimodal Factorized Bilinear (MFB) with co-attention [71], Stacked Attention Network [69] have beenconsidered for this study. All of these models are config-ured according to our dataset. For image and question fea-ture extraction, respectively, VGGNet (VGG 16) and Two-Layer LSTM are taken into account. Feature vector fromlast pooling layer of the VGGNet and 1024-D vector fromthe last word of Two-Layer LSTM are considered as theimage and question vectors respectively. Dataset is splitedinto training, validation and testing data. All the images areresized to 224 × 224 and questions are tokenized. By con-sidering cross-entropy loss, all the models are optimized bystochastic gradient descent ( SGD) with batch size 16. In thetraining phase, models are validated by validation datasetvia early stopping criterion with patience 30.

为了了解这些图像在洪水检测中的可用性,我们主要进行了三个任务,分别是图像分类、语义分割和视觉问答(VQA)。==我们首先将洪水网数据分类为洪水图像和非洪水图像,然后通过语义分割网络检测特定区域的洪水建筑、洪水道路、车辆等。最后,对该数据集进行VQA。==对于我们的所有任务,我们使用带有Intel Core i9处理器的NVIDIA GeForce RTX 2080 Ti GPU。在图像分类方面,我们使用了三种最先进的网络:inception tionnetv3[65]、ResNet50[32]和Xception[13]作为基础模型,将图像分为泛洪类和非泛洪类。这些网络通过引入独特的设计元素,如ResNet中的剩余块、inception中的多尺度架构和Xception中的深度可分离卷积,对计算机视觉领域做出了重大贡献。在我们的分类任务中,这些基本模型的输出是一个全局平均池化层,一个完全连接层,1024个神经元有Relu激活,最后两个神经元有Softmax激活。我们用ImageNet[20]权值初始化我们的网络,并使用二进制交叉熵损失对它们进行30个epoch的训练,每个epoch有20步。

在语义分割方面,我们实现了三种方法,即PSPNet[74]、ENet[55]和DeepLabv3+ [11];并在洪水网数据集上评估其性能。为了实现PSPNet,使用了ResNet101作为骨干。我们使用基本学习率为0.0001的“聚”学习率。将动量、重量衰减、功率、辅助损失重量分别设为0.9、0.0001、0.9、0.4。对于ENet,我们分别使用0.0005和0.1作为学习速率和学习速率衰减。重量衰减值设为0.0002。同样,对于DeepLabv3+,我们使用了基础学习率为0.01的poly学习率。我们把重量衰减设为0。0001动量设为0。9。 ==对于图像增强,我们使用了随机变换、缩放、翻转和随机旋转来帮助模型避免过拟合。==通过不同的实验证明,较大的“作物尺寸”和“批量尺寸”可以提高模型的性能。在训练过程中,我们将图像大小调整为713 × 713,因为大的作物尺寸对于高分辨率图像是有用的。对于语义分割的评价指标,我们使用了平均借据(mIoU)。

对于视觉问题回答,本研究考虑了简单基线(图像和文本特征的连接/元素乘积)和多模态分解双线性(MFB)协同注意[71],堆叠注意网络[69]。所有这些模型都是根据我们的数据集配置的。在图像特征提取和问题特征提取方面,分别考虑了VGGNet (VGG 16)和双层LSTM。分别以VGGNet最后池化层的特征向量和双层LSTM最后字的1024维向量作为图像向量和问题向量。数据集分为训练数据、验证数据和测试数据。所有的图像被调整为224 × 224和问题被标记。在考虑交叉熵损失的基础上,采用批数为16的随机梯度下降法(SGD)对所有模型进行优化。在训练阶段,模型通过验证数据集通过早期停止准则和耐心30进行验证。
[VQA文献阅读] FloodNet: A High Resolution Aerial Imagery Dataset for Post Flood Scene Understanding_第6张图片

4.1. Image Classification Analysis

The classification accuracies of the three networks are shown in Table 6. From this table, it can be seen that the highest performance on the test set was given by ResNet. The residual architecture of ResNet has successfully helped in classifying the test images into Flooded and
Non-Flooded, as compared to the other networks. Even though Xception and InceptionNet have a much wider architecture and show higher classification accuracy on ImageNet data, this is not the case for FloodNet dataset.Therefore, networks which give high accuracy on everyday images such as those of ImageNet can not really be usedto detect image features from aerial datasets which contain more complex urban and natural scenes. Thus, there is a need to design separate novel architectures which can effectively detect urban disasters.

从该表可以看出,测试集中性能最高的是由ResNet给出的。与其他网络相比,ResNet的剩余架构成功地帮助将测试图像分类为淹没和非淹没。尽管Xception和InceptionNet具有更广泛的架构,并在ImageNet数据上显示出更高的分类精度,但对于洪泛网数据集却不是这样。
因此,ImageNet等在日常图像上具有较高精度的网络无法真正用于包含更复杂的城市和自然场景的航空数据集的图像特征检测。因此,有必要设计能够有效探测城市灾害的独立新颖的建筑。

[VQA文献阅读] FloodNet: A High Resolution Aerial Imagery Dataset for Post Flood Scene Understanding_第7张图片

4.2. Semantic Segmentation Performance Analysis

Semantic segmentation results of ENet, DeepLabv3+,and PSPNet are presented in Table 4. From the segmentation experiment it is evident that detecting small objects like vehicles and pools are the most difficult tasks for the segmentation networks. Then flooded buildings and roads are the next challenging tasks for all three models. Among all of the segmentation models, PSPNet performs best in all classes. It is interesting to note that although DeepLabv3+ and PSPNet collect global contextual information, still their performance on detecting flooded building and flooded roads are still low, since distinguishing between flooded and non-flooded objects heavily depend on respective contexts of the classes.

ENet、DeepLabv3+、PSPNet的语义分割结果如表4所示。从分割实验中可以明显看出,对于分割网络来说,检测像车辆和水池这样的小物体是最困难的任务。然后,淹没的建筑物和道路是下一个具有挑战性的任务,这三个模型。在所有的分类模型中,PSPNet在所有类中表现最好。有趣的是,尽管DeepLabv3+和PSPNet收集了全局上下文信息,但它们在检测被淹没的建筑和被淹没的道路上的性能仍然很低,因为区分被淹没和非被淹没的对象很大程度上取决于各自的类上下文。

[VQA文献阅读] FloodNet: A High Resolution Aerial Imagery Dataset for Post Flood Scene Understanding_第8张图片

4.3. Visual Question Answering Performance Analysis

From the Table 5, we can identify that counting problem(simple and complex) is very challenging compare to task of condition recognition. Many objects are very small which makes it very difficult even for human to count. Accuracy for ‘Condition Recognition’ category is high. This is because it is not difficult to recognize the condition of whole images as well as roads as they are pictured in a larger ratio given the overall size of an image. MFB with co-attention [71] outperforms all the other methods for all types of question.

从表5中可以看出,计数问题(简单的和复杂的)与条件识别任务相比是非常具有挑战性的。许多物体都非常小,即使是人类也很难计算。“条件识别”类别的准确率很高。这是因为识别整个图像和道路的状况并不困难,因为在给定图像的整体大小下,它们的比例更大。在所有类型的问题上,具有共同注意力的MFB[71]优于所有其他方法。

[VQA文献阅读] FloodNet: A High Resolution Aerial Imagery Dataset for Post Flood Scene Understanding_第9张图片

5. Discussion and Conclusion

In this paper, we introduce the FloodNet dataset for post natural disaster damage assessment. We describe the dataset collection procedure along with different features and statistics. The UA V images provide high resolution and low altitude dataset specially significant for performing computer vision tasks. The dataset is annotated for classification, semantic segmentation, and VQA. We perform three computer vision tasks including image classification, semantic segmentation, and visual question answering and in-depth analysis have been provided for all three tasks.
Although UAVs are cost effective and prompt solution during any post natural disaster damage assessment, several challenges have been posed by FloodNet dataset collected using UAVs. Among all the existing classes, vehicles and pools are the smallest in shape and therefore would be difficult for any network models to detect them. Segmentation results from Table 4 supports the task difficulty in identifying small objects like vehicles and pools. Besides detecting flooded building is another prime challenge. Since UAV images only include top view of a building, it is very difficult to estimate how much damages are done on that building. Segmentation models do not perform well in detecting flooded buildings. Similarly flooded roads pose challenge in distinguishing them from non-flooded roads and results from segmentation models prove that. Most importantly distinguishing between flooded and non-flooded roads and buildings depends on their corresponding contexts and current state-of-art models are still lacking good performance in computer vision tasks performed on FloodNet. To the best of our knowledge this is the first time where these three crucial computer vision tasks have been addressed in a post natural disaster dataset together. The experiments of the dataset show great challenges and we strongly hope that FloodNet will motivate and support the development of more sophisticated models for deeper semantic understanding and post disaster damage assessment.

本文介绍了用于灾后灾害损失评估的洪涝网数据集。我们描述了数据集收集过程以及不同的特征和统计。UA图像提供了高分辨率和低海拔数据集,对执行计算机视觉任务特别重要。对数据集进行了分类、语义分割和VQA注释。我们执行三种计算机视觉任务,包括图像分类、语义分割和视觉问题回答和深入分析提供了所有这三种任务。尽管无人机在任何自然灾害后损害评估中都是成本有效和迅速的解决方案,但使用无人机收集的洪水网数据集提出了一些挑战。在所有现有的类中,车辆和池是形状最小的,因此任何网络模型都很难检测到它们。表4的分割结果支持了识别车辆、池等小型物体的任务难度。除了探测被淹没的建筑,另一个主要的挑战是。由于无人机图像只包括建筑物的俯视图,很难估计对建筑物造成了多少损害。分割模型不能很好地检测被淹建筑物。类似的被洪水淹没的道路对区分它们和非被洪水淹没的道路构成了挑战,分割模型的结果证明了这一点。最重要的是,区分被洪水淹没和非被洪水淹没的道路和建筑物取决于它们相应的环境,而目前最先进的模型在计算机视觉任务上在洪水网上执行时仍然缺乏良好的性能。据我们所知,这是第一次在自然灾害后的数据集中解决这三个关键的计算机视觉任务。数据集的实验显示了巨大的挑战,我们强烈希望“洪水网”能够激发和支持针对更深层次的语义理解开发更复杂的模型

6. Acknowledgment

This work is partially supported by Microsoft and Ama- zon.

你可能感兴趣的:(VQA,vqa,计算机视觉,深度学习)