Abstract:Due to the ubiquity of spatial data applications and the large amounts of spatial data that these applications generate and process, there is a pressing need for scalable spatial query processing. In this paper, we present new techniques for spatial query processing and optimization in an in-memory and distributed setup to address scalability. More specifically, we introduce new techniques for handling query skew, which is common in practice, and optimize communication costs accordingly. We propose a distributed query scheduler that use a new cost model to optimize the cost of spatial query processing. The scheduler generates query execution plans that minimize the effect of query skew. The query scheduler employs new spatial indexing techniques based on bitmap filters to forward queries to the appropriate local nodes. Each local computation node is responsible for optimizing and selecting its best local query execution plan based on the indexes and the nature of the spatial queries in that node. All the proposed spatial query processing and optimization techniques are prototyped inside Spark, a distributed memory-based computation system.
摘要:由于空间数据应用的普遍性以及这些应用所产生和处理的大量空间数据,迫切需要可扩展的空间查询处理。在本文中,我们提出了在内存和分布式设置中进行空间查询处理和优化的新技术,以解决可扩展性问题。更具体地说,我们引入了处理查询倾斜的新技术,这在实践中很常见,并相应地优化了通信成本。我们提出了一个分布式查询调度器,使用新的成本模型来优化空间查询处理的成本。该调度器生成的查询执行计划可以最大限度地减少查询偏移的影响。该查询调度器采用了基于位图过滤器的新的空间索引技术,将查询转发给适当的本地节点。每个本地计算节点负责根据该节点的索引和空间查询的性质来优化和选择其最佳本地查询执行计划。所有提出的空间查询处理和优化技术都是在Spark内部进行的,Spark是一个基于分布式内存的计算系统。
Conclusion:We presented a query executor and optimizer to improve the query execution plan generated for spatial queries.
We introduce a new spatial bitmap filter to reduce the redundant network communication cost. Empirical studies on various real datasets demonstrate the superiority of our approaches compared with existing systems.
结论: 我们提出了一个查询执行器和优化器,以改善为空间查询生成的查询执行计划。
我们引入了一个新的空间位图过滤器来减少冗余的网络通信成本。对各种真实数据集的实证研究表明,与现有系统相比,我们的方法具有优越性。
Abstract: The paper presents the details of designing and developing GEOSPARK, which extends the core engine of Apache Spark and SparkSQL to support spatial data types, indexes, and geometrical operations at scale. The paper also gives a detailed analysis of the technical challenges and opportunities of extending Apache Spark to support state-of-the-art spatial data partitioning techniques: uniform grid, R-tree, Quad-Tree, and KDB-Tree. The paper also shows how building local spatial indexes, e.g., R-Tree or Quad-Tree, on each Spark data partition can speed up the local computation and hence decrease the overall runtime of the spatial analytics program. Furthermore, the paper introduces a comprehensive experiment analysis that surveys and experimentally evaluates the performance of running de-facto spatial operations like spatial range, spatial K-Nearest Neighbors (KNN), and spatial join queries in the Apache Spark ecosystem. Extensive experiments on real spatial datasets show that GEOSPARK achieves up to two orders of magnitude faster run time performance than existing Hadoop-based systems and up to an order of magnitude faster performance than Spark-based systems.
摘要:本文介绍了设计和开发GEOSPARK的细节,它扩展了Apache Spark和SparkSQL的核心引擎,以支持空间数据类型、索引和几何操作的规模。论文还详细分析了扩展Apache Spark以支持最先进的空间数据分区技术的技术挑战和机遇:统一网格、R-树、四叉树和KDB-树。论文还展示了在每个Spark数据分区上建立本地空间索引,如R-Tree或Quad-Tree,可以加快本地计算速度,从而减少空间分析程序的整体运行时间。此外,本文介绍了一个全面的实验分析,调查和实验评估了Apache Spark生态系统中运行事实上的空间操作的性能,如空间范围、空间K-Nearest Neighbors(KNN)和空间连接查询。在真实空间数据集上的广泛实验表明,GEOSPARK的运行时间性能比现有的基于Hadoop的系统快两个数量级,比基于Spark的系统快一个数量级。
Abstract:This paper introduces GeoSpark an in-memory cluster computing framework for processing large-scale spatial data.
GeoSpark consists of three layers: Apache Spark Layer, Spatial RDD Layer and Spatial Query Processing Layer.
Apache Spark Layer provides basic Spark functionalities that include loading / storing data to disk as well as regular RDD operations. Spatial RDD Layer consists of three novel Spatial Resilient Distributed Datasets (SRDDs) which extend regular Apache Spark RDDs to support geometrical and spatial objects. GeoSpark provides a geometrical operations library that accesses Spatial RDDs to perform basic geometrical operations (e.g., Overlap, Intersect). System users can leverage the newly defined SRDDs to effectively develop spatial data processing programs in Spark. The Spatial Query Processing Layer efficiently executes spatial query processing algorithms (e.g., Spatial Range, Join, KNN query) on SRDDs. GeoSpark also allows users to create a spatial index (e.g., R-tree, Quad-tree ) that boosts spatial data processing performance in each SRDD partition. Preliminary experiments show that GeoSpark achieves better run time performance than its Hadoop-based counterparts
摘要:本文介绍了GeoSpark一个用于处理大规模空间数据的内存集群计算框架。GeoSpark由三层组成。Apache Spark层、空间RDD层和空间查询处理层。
Apache Spark层提供基本的Spark功能,包括加载/存储数据到磁盘以及常规的RDD操作。空间RDD层由三个新颖的空间弹性分布式数据集(SRDDs)组成,它们扩展了常规的Apache Spark RDDs,以支持几何和空间对象。GeoSpark提供了一个几何操作库,可以访问Spatial RDDs来执行基本的几何操作(如重叠、相交)。系统用户可以利用新定义的SRDDs,在Spark中有效地开发空间数据处理程序。空间查询处理层在SRDDs上有效地执行空间查询处理算法(如空间范围、联合、KNN查询)。GeoSpark还允许用户创建一个空间索引(如R-tree、Quad-tree),提升每个SRDD分区的空间数据处理性能。初步实验表明,GeoSpark比基于Hadoop的同类产品实现了更好的运行时间性能
Abstract:We present the Simba (Spatial In-Memory Big data Analytics) system, which offers scalable and efficient in-memory spatial query processing and analytics for big spatial data. Simba natively extends the Spark SQL engine to support rich spatial queries and analytics through both SQL and DataFrame API. It enables the construction of indexes over RDDs inside the engine in order to work with big spatial data and complex spatial operations. Simba also comes with an effective query optimizer, which leverages its indexes and novel spatial-aware optimizations, to achieve both low latency and high throughput in big spatial data analysis. This demonstration proposal describes key ideas in the design of Simba, and presents a demonstration plan.
摘要:我们提出了Simba(空间内存大数据分析)系统,它为大空间数据提供了可扩展和高效的内存空间查询处理和分析。Simba通过SQL和DataFrame API原生扩展了Spark SQL引擎,支持丰富的空间查询和分析。它可以在引擎内构建RDD的索引,以便处理大空间数据和复杂的空间操作。Simba还配有一个有效的查询优化器,利用其索引和新颖的空间感知优化,在大空间数据分析中实现低延迟和高吞吐量。本演示提案描述了Simba设计中的关键思想,并提出了一个演示计划。
Conclusion:We will present an end-to-end implementation of Simba and show the key features it provides. The demonstration is conducted on the following environments: Data set: we will have an HDFS instance deployed on a 10-node cluster hosting several data sets of various sizes sampled from OSM and GDELT [2]. The OSM data set will contain up to 500 million records, while GDELT will have up to 75 million records. Data sets of larger size are also available.
Simba cluster: During the demo, we will have a Simba instance running on the 10-node cluster. We will also run different queries on a Spark SQL instance on the same cluster, to make a side-byside comparison with Spark SQL.
Simba web console: To make Simba more accessible and easy to use for an end-user, we connect it to an open source visualization tool called Zeppelin [1] as shown in Figure 6. Together with the application UI of the underlying Spark system, users can have a clear view of Simba’s new grammar, query outputs and internal query execution plans. Users can also issue queries as they wish easily through the GUI in both SQL and Data Frame API, and see the results in various visualized forms.
Our demonstration plan enables attendees to easily explore spatial and temporal features of the underlying data from OSM and GDELT. First, we will show how to import data from different sources and build indexes over them. Then, we will issue range, Figure 6: Simba web console.
kNN and join queries to compare Simba’s performance with Spark SQL side by side. In addition, we will look into the query execution plan and see how Simba optimizes different queries.
We will also demonstrate the flexibility of Simba in terms of query expression by issuing a number of compound queries involving spatial join, range selections, grouping and aggregations (like the query we have shown in Section 3.1). Finally, another group of examples will present Simba’s DataFrame API, and how to use Simba to conduct a detailed analysis on a set of POIs over OSM and GDELT data sets. Attendees will be able to run other ad-hoc queries as well over the datasets through Simba’s web console.
结论:我们将介绍Simba的端到端实现,并展示其提供的主要功能。演示是在以下环境下进行的。数据集:我们将在一个10个节点的集群上部署一个HDFS实例,托管从OSM和GDELT[2]中抽出的几个不同规模的数据集。OSM数据集将包含多达5亿条记录,而GDELT将有多达7500万条记录。更大规模的数据集也可以使用。
Simba集群。在演示过程中,我们将有一个Simba实例在10个节点的集群上运行。我们还将在同一集群上的Spark SQL实例上运行不同的查询,以便与Spark SQL进行侧面比较。
Simba网络控制台。为了使Simba更容易被终端用户访问和使用,我们将其连接到一个名为Zeppelin[1]的开源可视化工具,如图6所示。与底层Spark系统的应用UI一起,用户可以清楚地看到Simba的新语法、查询输出和内部查询执行计划。用户还可以通过GUI轻松地以SQL和数据框架API发出他们想要的查询,并以各种可视化的形式看到结果。
我们的演示计划使与会者能够轻松地探索来自OSM和GDELT的基础数据的空间和时间特征。首先,我们将展示如何从不同的来源导入数据并在其上建立索引。然后,我们将发布范围、图6:Simba网络控制台。
kNN和连接查询,以比较Simba和Spark SQL的性能。此外,我们将研究查询的执行计划,看看Simba如何优化不同的查询。
我们还将通过发布一些涉及空间连接、范围选择、分组和聚合的复合查询(如我们在第3.1节中展示的查询),来展示Simba在查询表达方面的灵活性。最后,另一组例子将介绍Simba的DataFrame API,以及如何使用Simba对OSM和GDELT数据集上的一组POI进行详细分析。与会者将能够通过Simba的网络控制台在数据集上运行其他特定的查询。
Abstract: framework with native support for spatial data.
SpatialHadoop is a comprehensive extension to Hadoop that injects spatial data awareness in each Hadoop layer, namely, the language, storage, MapReduce, and operations layers. In the language layer, SpatialHadoop adds a simple and expressive high level language for spatial data types and operations. In the storage layer, SpatialHadoop adapts traditional spatial index structures, Grid, R-tree and R±tree, to form a two-level spatial index. SpatialHadoop enriches the MapReduce layer by two new components, SpatialFileSplitter and SpatialRecordReader, for efficient and scalable spatial data processing. In the operations layer, SpatialHadoop is already equipped with a dozen of operations, including range query, kNN, and spatial join. Other spatial operations are also implemented following a similar approach.
Extensive experiments on real system prototype and real datasets show that SpatialHadoop achieves orders of magnitude better performance than Hadoop for spatial data processing.
摘要:具备对空间数据的原生支持的框架。
SpatialHadoop是对Hadoop的全面扩展,它在每个Hadoop层,即语言层、存储层、MapReduce层和操作层中注入了空间数据意识。在语言层,SpatialHadoop为空间数据类型和操作增加了一种简单而富有表现力的高级语言。在存储层,SpatialHadoop调整了传统的空间索引结构,即Grid、R-tree和R±tree,以形成一个两级空间索引。SpatialHadoop通过两个新的组件SpatialFileSplitter和SpatialRecordReader丰富了MapReduce层,用于高效和可扩展的空间数据处理。在操作层,SpatialHadoop已经配备了十几种操作,包括范围查询、kNN和空间连接。其他空间操作也是按照类似的方法实现的。
在真实系统原型和真实数据集上的大量实验表明,SpatialHadoop在空间数据处理方面取得了比Hadoop好几个数量级的性能。
Abstract:Large spatial data becomes ubiquitous. As a result, it is critical to provide fast, scalable, and high-throughput spatial queries and analytics for numerous applications in location-based services (LBS).
Traditional spatial databases and spatial analytics systems are diskbased and optimized for IO efficiency. But increasingly, data are stored and processed in memory to achieve low latency, and CPU time becomes the new bottleneck. We present the Simba (Spatial In-Memory Big data Analytics) system that offers scalable and efficient in-memory spatial query processing and analytics for big spatial data. Simba is based on Spark and runs over a cluster of commodity machines. In particular, Simba extends the Spark SQL engine to support rich spatial queries and analytics through both SQL and the DataFrame API. It introduces indexes over RDDs in order to work with big spatial data and complex spatial operations. Lastly, Simba implements an effective query optimizer, which leverages its indexes and novel spatial-aware optimizations, to achieve both low latency and high throughput. Extensive experiments over large data sets demonstrate Simba’s superior performance compared against other spatial analytics system.
摘要:大型空间数据变得无处不在。因此,为基于位置的服务(LBS)的众多应用提供快速、可扩展和高吞吐量的空间查询和分析至关重要。
传统的空间数据库和空间分析系统是基于磁盘的,并为IO效率而优化。但是,越来越多的数据被存储和处理在内存中以实现低延迟,CPU时间成为新的瓶颈。我们提出了Simba(空间内存大数据分析)系统,该系统为大空间数据提供可扩展和高效的内存空间查询处理和分析。Simba以Spark为基础,在一个商品机集群上运行。特别是,Simba扩展了Spark的SQL引擎,通过SQL和DataFrame API支持丰富的空间查询和分析。它在RDDs上引入了索引,以便处理大的空间数据和复杂的空间操作。最后,Simba实现了一个有效的查询优化器,利用其索引和新的空间感知优化,以实现低延迟和高吞吐量。在大型数据集上进行的大量实验表明,与其他空间分析系统相比,Simba的性能更加优越。
Abstract:Despite two decades of research in moving object databases and a few research prototypes that have been proposed, there is not yet a mainstream system targeted for industrial use. In this article, we present MobilityDB, a moving object database that extends the type system of PostgreSQL and PostGIS with abstract data types for representing moving object data. The types are fully integrated into the platform to reuse its powerful data management features. Furthermore, MobilityDB builds on existing operations, indexing, aggregation, and optimization framework. This is all made accessible via the SQL query interface.
摘要:尽管对移动对象数据库进行了20年的研究,并提出了一些研究原型,但目前还没有一个针对工业使用的主流系统。在这篇文章中,我们提出了MobilityDB,一个移动对象数据库,它用抽象的数据类型扩展了PostgreSQL和PostGIS的类型系统,用于表示移动对象数据。这些类型被完全集成到平台中,以重新使用其强大的数据管理功能。此外,MobilityDB建立在现有的操作、索引、聚合和优化框架之上。这一切都可以通过SQL查询界面进行访问。
Abstract:This demo presents SpatialHadoop as the first full-fledged MapReduce framework with native support for spatial data. SpatialHadoop is a comprehensive extension to Hadoop that pushes spatial data inside the core functionality of Hadoop. SpatialHadoop runs existing Hadoop programs as is, yet, it achieves order(s) of magnitude better performance than Hadoop when dealing with spatial data. SpatialHadoop employs a simple spatial high level language, a two-level spatial index structure, basic spatial components built inside the MapReduce layer, and three basic spatial operations: range queries, k-NN queries, and spatial join. Other spatial operations can be similarly deployed in SpatialHadoop. We demonstrate a real system prototype of SpatialHadoop running on an Amazon EC2 cluster against two sets of real spatial data obtained from Tiger Files and OpenStreetMap with sizes 60GB and 300GB, respectively.
摘要:这个演示介绍了SpatialHadoop是第一个全面的MapReduce框架,对空间数据有原生支持。SpatialHadoop是对Hadoop的全面扩展,将空间数据推入Hadoop的核心功能。SpatialHadoop可以按原样运行现有的Hadoop程序,但在处理空间数据时,它的性能比Hadoop要好几个数量级。SpatialHadoop采用了简单的空间高级语言、两级空间索引结构、MapReduce层内的基本空间组件和三种基本空间操作:范围查询、k-NN查询和空间连接。其他空间操作也可以类似地部署在SpatialHadoop中。我们针对从Tiger Files和OpenStreetMap获得的两组真实空间数据,展示了SpatialHadoop在Amazon EC2集群上运行的真实系统原型,这些数据的大小分别为60GB和300GB。
Abstract:Support of high performance queries on large volumes of spatial data becomes increasingly important in many application domains, including geospatial problems in numerous fields, location based services, and emerging scientific applications that are increasingly data- and compute-intensive. The emergence of massive scale spatial data is due to the proliferation of cost effective and ubiquitous positioning technologies, development of high resolution imaging technologies, and contribution from a large number of community users. There are two major challenges for managing and querying massive spatial data to support spatial queries: the explosion of spatial data, and the high computational complexity of spatial queries.
In this paper, we present Hadoop-GIS – a scalable and high performance spatial data warehousing system for running large scale spatial queries on Hadoop. Hadoop-GIS supports multiple types of spatial queries on MapReduce through spatial partitioning, customizable spatial query engine RESQUE, implicit parallel spatial query execution on MapReduce, and effective methods for amending query results through handling boundary objects. Hadoop-GIS utilizes global partition indexing and customizable on demand local spatial indexing to achieve efficient query processing. Hadoop-GIS is integrated into Hive to support declarative spatial queries with an integrated architecture. Our experiments have demonstrated the high efficiency of Hadoop-GIS on query response and high scalability to run on commodity clusters. Our comparative experiments have showed that performance of Hadoop-GIS is on par with parallel SDBMS and outperforms SDBMS for compute-intensive queries. Hadoop-GIS is available as a set of library for processing spatial queries, and as an integrated software package in Hive.
摘要:支持对大量空间数据的高性能查询在许多应用领域变得越来越重要,包括许多领域的地理空间问题、基于位置的服务以及越来越多的数据和计算密集型的新兴科学应用。海量空间数据的出现是由于成本效益高且无处不在的定位技术的普及,高分辨率成像技术的发展,以及大量社区用户的贡献。管理和查询海量空间数据以支持空间查询有两大挑战:空间数据的爆炸和空间查询的高计算复杂性。
在本文中,我们介绍了Hadoop-GIS–一个可扩展和高性能的空间数据仓库系统,用于在Hadoop上运行大规模的空间查询。Hadoop-GIS通过空间分区、可定制的空间查询引擎RESQUE、MapReduce上的隐式并行空间查询执行,以及通过处理边界对象修正查询结果的有效方法,支持MapReduce上的多种类型的空间查询。Hadoop-GIS利用全局分区索引和可定制的按需本地空间索引来实现高效的查询处理。Hadoop-GIS被集成到Hive中,以支持具有集成架构的声明式空间查询。我们的实验证明了Hadoop-GIS在查询响应方面的高效率和在商品集群上运行的高可扩展性。我们的比较实验表明,Hadoop-GIS的性能与并行的SDBMS相当,并且在计算密集型查询方面优于SDBMS。Hadoop-GIS可以作为一套处理空间查询的库,也可以作为Hive中的一个集成软件包。
Abstract:SpatialHadoop is an extended MapReduce framework that supports global indexing that spatial partitions the data across machines providing orders of magnitude speedup, compared to traditional Hadoop. In this paper, we describe seven alternative partitioning techniques and experimentally study their effect on the quality of the generated index and the performance of range and spatial join queries. We found that using a 1% sample is enough to produce high quality partitions. Also, we found that the total area of partitions is a reasonable measure of the quality of indexes when running spatial join. This study will assist researchers in choosing a good spatial partitioning technique in distributed environments.
摘要: SpatialHadoop是一个扩展的MapReduce框架,它支持全局索引,与传统的Hadoop相比,它在不同的机器上对数据进行空间分割,提供了数量级的速度。在本文中,我们描述了七种可供选择的分区技术,并通过实验研究了它们对生成的索引的质量以及范围和空间连接查询的性能的影响。我们发现,使用1%的样本就足以产生高质量的分区。另外,我们还发现,在运行空间连接时,分区的总面积是衡量索引质量的一个合理指标。这项研究将帮助研究人员在分布式环境中选择一个好的空间分区技术。
Abstract: W We present LocationSpark, a spatial data processing system built on top of Apache Spark, a widely used distributed data processing system. LocationSpark offers a rich set of spatial query operators, e.g., range search, kNN, spatio-textual operation, spatial-join, and kNN-join. To achieve high performance, LocationSpark employs various spatial indexes for in-memory data, and guarantees that immutable spatial indexes have low overhead with fault tolerance. In addition, we build two new layers over Spark, namely a query scheduler and a query executor. The query scheduler is responsible for mitigating skew in spatial queries, while the query executor selects the best plan based on the indexes and the nature of the spatial queries. Furthermore, to avoid unnecessary network communication overhead when processing overlapped spatial data, We embed an efficient spatial Bloom filter into LocationSpark’s indexes. Finally, LocationSpark tracks frequently accessed spatial data, and dynamically flushes less frequently accessed data into disk. We evaluate our system on real workloads and demonstrate that it achieves an order of magnitude performance gain over a baseline framework.e present LocationSpark, a spatial data processing system built on top of Apache Spark, a widely used distributed data processing system. LocationSpark offers a rich set of spatial query operators, e.g., range search, kNN, spatio-textual operation, spatial-join, and kNN-join. To achieve high performance, LocationSpark employs various spatial indexes for in-memory data, and guarantees that immutable spatial indexes have low overhead with fault tolerance. In addition, we build two new layers over Spark, namely a query scheduler and a query executor. The query scheduler is responsible for mitigating skew in spatial queries, while the query executor selects the best plan based on the indexes and the nature of the spatial queries. Furthermore, to avoid unnecessary network communication overhead when processing overlapped spatial data, We embed an efficient spatial Bloom filter into LocationSpark’s indexes. Finally, LocationSpark tracks frequently accessed spatial data, and dynamically flushes less frequently accessed data into disk. We evaluate our system on real workloads and demonstrate that it achieves an order of magnitude performance gain over a baseline framework.
摘要:我们提出了LocationSpark,一个建立在Apache Spark之上的空间数据处理系统,Apache Spark是一个广泛使用的分布式数据处理系统。LocationSpark提供了一套丰富的空间查询操作符,例如范围搜索、kNN、空间文本操作、空间连接和kNN连接。为了实现高性能,LocationSpark为内存数据采用了各种空间索引,并保证不可变的空间索引具有低开销的容错性。此外,我们在Spark之上建立了两个新的层次,即查询调度器和查询执行器。查询调度器负责缓解空间查询中的偏斜,而查询执行器则根据索引和空间查询的性质选择最佳计划。此外,为了避免在处理重叠的空间数据时产生不必要的网络通信开销,我们在LocationSpark的索引中嵌入了一个高效的空间布鲁姆过滤器。最后,LocationSpark跟踪频繁访问的空间数据,并动态地将不太频繁访问的数据冲入磁盘。我们在实际工作负载上评估了我们的系统,并证明它比基线框架实现了一个数量级的性能提升。
Abstract:With the explosive use of GPS-enabled devices, increasingly massive volumes of trajectory data capturing the movements of people and vehicles are becoming available, which is useful in many application areas, such as transportation, traffic management, and location-based services. As a result, many trajectory data management and analytic systems have emerged that target either offline or online settings. However, some applications call for both offline and online analyses. For example, in traffic management scenarios, offline analyses of historical trajectory data can be used for traffic planning purposes, while online analyses of streaming trajectories can be adopted for congestion monitoring purposes. Existing trajectory-based systems tend to perform offline and online trajectory analysis separately, which is inefficient. In this paper, we propose a hybrid and efficient framework, called Dragoon, based on Spark, to support both offline and online big trajectory management and analytics. The framework features a mutable resilient distributed dataset model, including RDD Share, RDD Update, and RDD Mirror, which enables hybrid storage of historical and streaming trajectories. It also contains a real-time partitioner capable of efficiently distributing trajectory data and supporting both offline and online analyses. Therefore, Dragoon provides a hybrid analysis pipeline. Support for several typical trajectory queries and mining tasks demonstrates the flexibility of Dragoon.
An extensive experimental study using both real and synthetic trajectory datasets shows that Dragoon (1) has similar offline trajectory query performance with the state-of-the-art system UlTraMan; (2) decreases up to doubled storage overhead compared with UlTraMan during trajectory editing; (3) achieves at least 40% improvement of scalability compared with popular streaming processing frameworks (i.e., Flink and Spark Streaming); and (4) offers an average doubled performance improvement for online trajectory data analytics.
摘要:随着GPS设备的爆炸性使用,越来越多的捕捉人和车辆运动的轨迹数据变得可用,这在许多应用领域是有用的,如运输、交通管理和基于位置的服务。因此,许多轨迹数据管理和分析系统已经出现,其目标是离线或在线设置。然而,有些应用需要同时进行离线和在线分析。例如,在交通管理场景中,历史轨迹数据的离线分析可用于交通规划目的,而流式轨迹的在线分析可用于拥堵监测目的。现有的基于轨迹的系统倾向于分别进行离线和在线轨迹分析,这是不高效的。在本文中,我们提出了一个混合的高效框架,称为Dragoon,基于Spark,支持离线和在线大轨迹管理和分析。该框架具有可变的弹性分布式数据集模型,包括RDD共享、RDD更新和RDD镜像,可以实现历史轨迹和流式轨迹的混合存储。它还包含一个实时分区器,能够有效地分配轨迹数据并支持离线和在线分析。因此,Dragoon提供了一个混合分析管道。对几个典型轨迹查询和挖掘任务的支持显示了Dragoon的灵活性。
使用真实和合成轨迹数据集的广泛实验研究表明,Dragoon(1)具有与最先进的系统UlTraMan相似的离线轨迹查询性能;(2)在轨迹编辑期间,与UlTraMan相比,存储开销减少了一倍;(3)与流行的流处理框架(即Flink和Spark Streaming)相比,实现了至少40%的可扩展性改进;以及(4)为在线轨迹数据分析提供平均两倍的性能改进。