SIGMOD 2017论文的摘要与看法

SIGMOD2017

持续更新

3.1 Concurrency并发

ACIDRain: Concurrency-Related Attacks on Database-Backed Web Applications
ACIDRain:对数据库支持的Web应用程序的并发性攻击


Cicada: Dependably Fast Multi-Core In-Memory Transactions
Cicada:依赖于快速的多核内存事务


BatchDB: Efficient Isolated Execution of Hybrid OLTP+OLAP Workloads for Interactive Applications
BatchDB:用于交互式应用程序的混合OLTP+OLAP工作负载的高效隔离执行


Transaction Repair for Multi-Version Concurrency Control
多版本并发控制的事务修复


Concerto: A High Concurrency Key-Value Store with Integrity
Concerto:具有完整性的高并发键值存储


Fast Failure Recovery for Main-Memory DBMSs on Multicores
多核心的主存DBMSs的快速故障恢复


Bringing Modular Concurrency Control to the Next Level
将模块化并发控制引入下一层

3.2 Storage and Distribution 存储与分布式

Azure Data Lake Store: A Hyperscale Distributed File Service for Big Data Analytics
Azure数据池存储:大型数据分析的超规模分布式文件服务


OctopusFS: A Distributed File System with Tiered Storage Management
octopus usfs:一个具有分层存储管理的分布式文件系统


Monkey: Optimal Navigable Key-Value Store
Monkey: 最佳的适合航行的、可驾驶的键值对存储
【单词】:
Navigable:adj. 可航行的;可驾驶的;适于航行的


Wide Table Layout Optimization based on Column Ordering and Duplication


Query Centric Partitioning and Allocation for Partially Replicated Database Systems


Spanner: Becoming a SQL System


3.3 Streams 数据流

Enabling Signal Processing over Data Streams
数组数据的增量视图维护


Complete Event Trend Detection in High-Rate Event Streams
在高速率事件流中的完成事件趋势检测


LittleTable: A Time-Series Database and Its Uses
LittleTable:一个时间序列数据库及其用途

3.4 Versions and Incremental Maintenance 版本和增量维护

Incremental View Maintenance over Array Data
数组数据的增量视图维护


Incremental Graph Computations: Doable and Undoable 增量图计算:可操作和不可操作


DEX: Query Execution in a Delta-based Storage System
DEX:基于delta的存储系统中的查询执行

3.5 Parallel and Distributed Query Processing

Massively Parallel Processing of Whole Genome Sequence Data: An In-Depth Performance Study 全基因组序列数据的大规模并行处理:一项深入的性能研究


Distributed Provenance Compression
分布式来源压缩

Abstract
Network provenance, which records the execution history of network events as meta-data, is becoming increasingly important for network accountability and failure diagnosis. For example, network provenance may be used to trace the path that a message traversed in a network, or to reveal how a particular routing entry was derived and the parties involved in its derivation. A challenge when storing the provenance of a live network is that the large number of arriving messages may incur substantial storage overhead. In this paper, we explore techniques to dynamically compress distributed provenance stored at scale. Logically, compression is achieved by grouping equivalent provenance trees and maintaining only one concrete copy for each equivalence class. To efficiently identify the equivalent provenance, we (1) introduce distributed event-based linear programs (DELPs) to specify distributed network applications, and (2) statically analyze DELPs to allow for quick detection of provenance equivalence at runtime. Our experimental results demonstrate that our approach leads to significant storage reduction and query latency improvement over alternative approaches.

摘要:
网络起源作为元数据记录网络事件的执行历史,对网络可靠性和故障诊断越来越重要。例如,可以使用网络起源跟踪消息在网络中遍历的路径,或者揭示特定的路由条目是如何派生的,以及与之相关的各方。存储实时网络起源时的一个挑战是,大量到达的消息可能导致大量的存储开销。本文探讨了动态压缩规模化存储的分布式种源的技术。从逻辑上讲,压缩是通过对等价种源树进行分组,并为每个等价类维护一个具体副本来实现的。为了有效地识别等价种源,我们(1)引入了基于事件的分布式线性程序(DELPs)来指定分布式网络应用程序,(2)静态分析DELPs,以便在运行时快速检测种源等价性。我们的实验结果表明,与其他方法相比,我们的方法可以显著减少存储和查询延迟。


ROBUS: Fair Cache Allocation for Data-parallel Workloads ROBUS:数据并行工作负载的公平缓存分配


Heterogeneity-aware Distributed Parameter Servers 了解异质性的分布参数服务器


Distributed Algorithms on Exact Personalized PageRank 分布式算法的精确个性化PageRank


Parallelizing Sequential Graph Computations (Best paper award)
并行序列图计算

3.6 Tree & Graph Processing

Landmark Indexing for Evaluation of Label-Constrained Reachability Queries 标记索引用于评估标签约束的可达性查询


Efficient Ad-Hoc Graph Inference and Matching in Biological Databases
生物数据库中高效的自适应图推理与匹配


DAG Reduction: Fast Answering Reachability Queries
DAG减少:快速响应可达性查询


Flexible and Feasible Support Measures for Mining Frequent Patterns in Large Labeled Graphs
在大标记图形中挖掘频繁模式的灵活性与可行性的支持措施


Exploiting Common Patterns for Tree-Structured Data
为树结构数据开发通用模式


Extracting and Analyzing Hidden Graphs from Relational Databases
从关系数据库中提取和分析隐藏的图


TrillionG: A Trillion-scale Synthetic Graph Generator using a Recursive Vector Model
TrillionG:使用递归向量模型的万亿级合成图生成器


ZipG: A Memory-efficient Graph Store for Interactive Queries


All-in-One: Graph Processing in RDBMSs Revisited


Computing A Near-Maximum Independent Set in Linear Time by Reducing-Peeling

3.7 New Hardware

Accelerating Pattern Matching Queries in Hybrid CPU-FPGA Architectures


A Memory Bandwidth-Efficient Hybrid Radix Sort on GPUs
FPGA-based Data Partitioning


Template Skycube Algorithms for Heterogeneous Parallelism on Multicore and GPU Architectures

3.8 Interactive Data Exploration and AQP 交互式数据探索和AQP

Controlling False Discoveries During Interactive Data Exploration
在交互式数据挖掘过程中控制错误的发现


MacroBase: Prioritizing Attention in Fast Data


Data Canopy: Accelerating Exploratory Statistical Analysis


Two-Level Sampling for Join Size Estimation


A General-Purpose Counting Filter: Making Every Bit Count


BePI: Fast and Memory-Efficient Method for Billion-Scale Random Walk with Restart

3.9 Beliefs, Conflicts, Knowledge

Beta Probabilistic Databases: A Scalable Approach to Belief Updating and Parameter Learning


Database Learning: Toward a Database that Becomes Smarter Every Time


Staging User Feedback toward Rapid Conflict Resolution in Data Fusion

3.10 Influence in Social Networks

Discovering Your Selling Points: Personalized Social Influential Tags Exploration


Coarsening Massive Influence Networks for Scalable Diffusion Analysis


Debunking the Myths of Influence Maximization: An In-Depth Benchmarking Study

3.11Mappings, Transformations, Pricing

Interactive Mapping Specification with Exemplar Tuples


Foofah: Transforming Data By Example


QIRANA: A Framework for Scalable Query Pricing

3.12 Optimization and Performance

Access Path Selection in Main-Memory Optimized Data Systems: Should I Scan or Should I Probe?


Optimization of Disjunctive Predicates for Main Memory Column Stores


A Top-Down Approach to Achieving Performance Predictability in Database Systems


An Experimental Study of Bitmap Compression vs. Inverted List Compression


Automatic Database Management System Tuning Through Large-scale Machine Learning


Solving the Join Ordering Problem via Mixed Integer Linear Programming


Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases

3.13 User Preferences

Determining the Impact Regions of Competing Options in Preference Space


Efficient Computation of Regret-ratio Minimizing Set: A Compact Maxima Representative


FEXIPRO: Fast and Exact Inner Product Retrieval in Recommender Systems


Feedback-Aware Social Event-Participant Arrangement

3.14 Machine Learning

Schema Independent Relational Learning


Scalable Kernel Density Classification via Threshold-Based Pruning


The BUDS Language for Distributed Bayesian Machine Learning


A Cost-based Optimizer for Gradient Descent Optimization

3.15 Encryption 加密

Fast Searchable Encryption With Tunable Locality


Cryptanalysis of Comparable Encryption in SIGMOD’16


BLOCKBENCH: A Framework for Analyzing Private Blockchains

3.16 Cleaning, Versioning, Fusion 清洗、版本控制、融合

Living in Parallel Realities: Co-Existing Schema Versions with a Bidirectional Database Evolution Language


Synthesizing Mapping Relationships Using Table Corpus


Waldo: An Adaptive Human Interface for Crowd Entity Resolution


Online Deduplication for Databases


QFix: Diagnosing Errors through Query Histories


UGuide: User-Guided Discovery of FD-Detectable Errors


SLiMFast: Guaranteed Results for Data Fusion and Source Reliability

3.17 Spatial and Multidimensional Data 空间和多维数据

Utility-Aware Ridesharing on Road Networks
道路网络上具有实用价值汽车共享

ABSTRACT
Ridesharing enables drivers to share any empty seats in their vehicles with riders to improve the efficiency of transportation for the benefit of both drivers and riders. Different from existing studies in ridesharing that focus on minimizing the travel costs of vehicles, we consider that the satisfaction of riders (the utility values) is more important nowadays. Thus, we formulate the problem of utility-aware ridesharing on road networks (URR) with the goal of providing the optimal rider schedules for vehicles to maximize the overall utility, subject to spatial-temporal and capacity constraints. To assign a new rider to a given vehicle, we propose an efficient algorithm with a minimum increase in travel cost without reordering the existing schedule of the vehicle. We prove that the URR problem is NP-hard by reducing it from the 0-1 Knapsack problem and it is unlikely to be approximated within any constant factor in polynomial time through a reduction from the DENS kSUBGRAPH problem. Therefore, we propose three efficient approximate algorithms, including a bilateral arrangement algorithm, an efficient greedy algorithm and a grouping-based scheduling algorithm, to assign riders to suitable vehicles with a high overall utility. Through extensive experiments, we demonstrate the efficiency and effectiveness of our URR approaches on both real and synthetic data sets.
文摘
Ridesharing可以让司机在他们的车辆中与乘客分享任何空的座位,以提高交通效率,从而使司机和乘客都受益。与现有的研究不同,ridesharing关注的是最小化车辆的旅行成本,我们认为乘客的满意度(效用价值)在当今更加重要。因此,我们制定了道路网络上的效用感知乘车问题(URR),目标是在时空和容量限制的情况下,为车辆提供最优的骑乘时间表,以最大限度地提高整体效用。为了给给定的车辆分配一个新的车手,我们提出了一种有效的算法,在不重新订购现有的车辆时刻表的情况下,使旅行成本最低。我们通过把URR问题从0-1背包问题中简化出来,证明了URR问题是np困难的,而且它不太可能通过k子图问题的简化在多项式时间的任何常数因子中近似。因此,我们提出了三种有效的近似算法,包括一种双边排列算法,一种高效的贪婪算法和一种基于分组的调度算法,将乘客分配给具有高总效用的合适车辆。通过大量的实验,我们证明了我们的URR方法在真实和合成数据集上的有效性和有效性。
【单词】 Ridesharing:汽车共享、驾驶共享;
formulate:vt. 规划;用公式表示;明确地表达;
utility:n. 实用;效用;公共设施;功用;adj. 实用的;通用的;有多种用途的;
constraints:n. [数] 约束;限制;约束条件(constraint的复数形式);
Subsequently:adv. 随后,其后;后来;
in the sequel:后来,后续,结果;


Distance Oracle on Terrain Surface
地形表面的距离


Efficient Computation of Top-k Frequent Terms over Spatio-temporal Ranges
在时空范围内有效地计算top-k频率

ABSTRACT:
The wide availability of tracking devices has drastically increased the role of geolocation in social networks, resulting in new commercial applications; for example, marketers can identify current trending topics within a region of interest and focus their products accordingly. In this paper we study a basic analytics query on geotagged data, namely: given a spatiotemporal region, find the most frequent terms among the social posts in that region. While there has been prior work on keyword search on spatial data (find the objects nearest to the query point that contain the query keywords), and on group keyword search on spatial data (retrieving groups of objects), our problem is different in that it returns keywords and aggregated frequencies as output, instead of having the keyword as input. Moreover, we differ from works addressing the streamed version of this query in that we operate on large, disk resident data and we provide exact answers. We propose an index structure and algorithms to efficiently answer such top-k spatiotemporal range queries, which we refer as Top-k Frequent Spatiotemporal Terms (kFST) queries. Our index structure employs an R-tree augmented by top-k sorted term lists (STLs), where a key challenge is to balance the size of the index to achieve faster execution and smaller space requirements. We theoretically study and experimentally validate the ideal length of the stored term lists, and perform detailed experiments to evaluate the performance of the proposed methods compared to baselines on real datasets.

摘要:
跟踪设备的广泛可用性极大地增加了地理定位在社交网络中的作用,从而产生了新的商业应用;例如,市场营销人员可以在感兴趣的区域内识别当前的趋势话题,并相应地关注他们的产品。本文研究了地理位置数据的基本分析查询,即:给定一个时空区域,在该区域的social posts中找到最频繁的terms。虽然之前一直工作在关键词搜索空间数据(找到距离查询点最近的,且包含查询关键词的对象),和组合关键字搜索空间数据(检索组对象),我们的问题是不同的,它返回关键字和聚合频率作为输出,而不是关键字作为输入。此外,我们与处理这个查询的流式版本的工作不同,因为我们的运算是大的、磁盘驻留数据并提供准确的答案。我们提出了一种索引结构和算法可以有效地回答这样的top-k时空范围查询,我们称之为top-k频繁的时空terms(kFST)查询。我们的索引结构采用了由top-k排序术语列表(STLs)增强的r树,其中的关键挑战是平衡索引的大小,以实现更快的执行速度和更小的空间需求。我们从理论上的研究和真实实验验证了存储术语列表的理想长度,并进行了详细的实验,在实际数据集上评估提出方法的性能与baseline进行比较。

【单词】
geotag:地理标签;
novel data:新颖的数据;
trajectories:轨道轨迹;
spatiotemporal:adj. 时空的;存在于时间与空间上的; user-specified:用户指定的; intersection:n. 交叉;十字路口;交集;交叉点;
leverage:n. 手段,影响力;杠杆作用;杠杆效率;v. 利用;举债经营;
degrade:vt. 贬低;使……丢脸;使……降级;使……降解;vi. 降级,降低;退化;
materialize:vi. 实现,成形;突然出现;vt. 使具体化,使有形;使突然出现;使重物质而轻精神;


Scaling Locally Linear Embedding
缩放局部线性嵌入

ABSTRACT
Locally Linear Embedding (LLE) is a popular approach to dimensionality reduction as it can effectively represent nonlinear structures of high-dimensional data. For dimensionality reduction, it computes a nearest neighbor graph from a given dataset where edge weights are obtained by applying the Lagrange multiplier method, and it then computes eigenvectors of the LLE kernel where the edge weights are used to obtain the kernel. Although LLE is used in many applications, its computation cost is significantly high. This is because, in obtaining edge weights, its computation cost is cubic in the number of edges to each data point. In addition, the computation cost in obtaining the eigenvectors of the LLE kernel is cubic in the number of data points. Our approach, Ripple, is based on two ideas: (1) it incrementally updates the edge weights by exploiting the Woodbury formula and (2) it efficiently computes eigenvectors of the LLE kernel by exploiting the LU decomposition-based inverse power method. Experiments show that Ripple is significantly faster than the original approach of LLE by guaranteeing the same results of dimensionality reduction.

摘要
局部线性嵌入是一种流行的降维方法,因为它可以有效地表示高维数据的非线性结构。对于降维,它利用拉格朗日乘子方法从给定的数据集中边的权重计算最近的邻图,在给定的数据集中,利用边的权值得到核的特征向量。虽然LLE方法在许多应用中使用,但其计算成本仍然很高。这是因为,在获取边权值时,它的计算成本是每个数据点的边数的立方。此外,获取LLE核特征向量的计算成本在数据点的数量上是立方的。我们提出的Ripple方法是基于两个概念:(1)利用Woodbury公式增量地更新边缘权值;(2)利用基于LU分解的逆幂方法有效地计算LLE核的特征向量。实验表明,在保证降维结果相同的前提下,纹波比传统的方法快得多。

【单词】
reduction:n.减少,下降;
eigenvectors:n. [数] 特征向量; 本征矢量(eigenvector的复数形式);
cubic:adj. 立方体的,立方的;
decomposition:n. 分解,腐烂;变质;


Dynamic Density Based Clustering
动态的基于密度的聚类

ABSTRACT
Dynamic clustering—how to efficiently maintain data clusters along with updates in the underlying dataset—is a difficult topic. This is especially true for density-based clustering, where objects are aggregated based on transitivity of proximity, under which deciding the cluster(s) of an object may require the inspection of numerous other objects. The phenomenon is unfortunate, given the popular usage of this clustering approach in many applications demanding data updates. Motivated by the above, we investigate the algorithmic principles for dynamic clustering by DBSCAN, a successful representative of density-based clustering, and ρ-approximate DBSCAN, proposed to bring down the computational hardness of the former on static data. Surprisingly, we prove that the ρ-approximate version suffers from the very same hardness when the dataset is fully dynamic, namely, when both insertions and deletions are allowed. We also show that this issue goes away as soon as tiny further relaxation is applied, yet still ensuring the same quality—known as the “sandwich guarantee”—of ρ-approximate DBSCAN. Our algorithms guarantee near-constant update processing, and outperform existing approaches by a factor over two orders of magnitude.

文摘
动态集群——如何高效地维护数据集群以及底层数据的更新——是一个难题。这对于基于密度的集群尤其适用,因为在这种集群中,对象是基于近邻的的可传递性进行聚合的,在这种情况下,决定一个对象的集群可能需要对许多其他对象进行检查。这种现象很不幸,因为这种聚类方法在许多需要数据更新的应用程序中很流行。
出于上述,我们调查DBSCAN算法动态聚类原则,一个成功的基于密度的聚类代表、和ρ-approximate DBSCAN,提出降低计算难度前的静态数据。令人惊讶的是,我们证明ρ-approximate版本有同样的难度的数据集是完全动态的,也就是说,当插入和删除都是允许的。我们还表明,这个问题就消失微小的进一步放松,但仍然保证quality-known一样“三明治保证”——ρ-approximate DBSCAN。我们的算法保证了几乎恒定的更新处理,并且在两个数量级上超过了现有方法。
【单词】
inspection:检查、视察;


Extracting Top-K Insights from Multi-dimensional Data
从多维数据中提取Top-K insight

ABSTRACT
OLAP tools have been extensively used by enterprises to make better and faster decisions. Nevertheless, they require users to specify group-by attributes and know precisely what they are looking for. This paper takes the first attempt towards automatically extracting top-k insights from multi-dimensional data. This is useful not only for non-expert users, but also reduces the manual effort of data analysts. In particular, we propose the concept of insight which captures interesting observation derived from aggregation results in multiple steps (e.g., rank by a dimension, compute the percentage of measure by a dimension). An example insight is: “Brand B’s rank (across brands) falls along the year, in terms of the increase in sales”. Our problem is to compute the top-k insights by a score function. It poses challenges on (i) the effectiveness of the result and (ii) the efficiency of computation. We propose a meaningful scoring function for insights to address (i). Then, we contribute a computation framework for top-k insights, together with a suite of optimization techniques (i.e., pruning, ordering, specialized cube, and computation sharing) to address (ii). Our experimental study on both real data and synthetic data verifies the effectiveness and efficiency of our proposed solution.

文摘
OLAP工具已经被企业广泛地用于做出更好、更快的决策。尽管如此,它们要求用户指定需要进行group-by的属性,并准确地知道他们需要什么。本文首次尝试从多维数据中自动提取top-k insights。这不仅对非专业用户有用,而且减少了数据分析师的手工工作。特别是,我们提出了insight的概念,它捕捉了从聚合结果中得到的有趣的观察结果,这些观察通过使用多个步骤实现,(例如,按维进行排序,按照维度计算度量的百分比(the percentage of measure))。insight的一个例子是:“品牌B的排名(各品牌之间的排名)逐年下降,以销售额的增长来衡量”。我们的问题是用一个score打分函数来计算top-k insight。它对结果的有效性和计算的效率提出了挑战。我们提出了一个有意义的评分函数来处理(i)。然后,我们为top-k insights提供了一个计算框架,并提供了一套优化技术(例如剪枝、排序、专用数据集、计算共享).我们对真实数据和合成数据的实验研究验证了我们提出的解决方案的有效性和有效性。


QUILTS: Multidimensional Data Partitioning Framework Based on Query-Aware and Skew-Tolerant Space-Filling Curves
QUILTS: 基于查询感知和双向空间填充曲线的多维数据分区框架

ABSTRACT
Recently, massive data management plays an increasingly important role in data analytics because data access is a major bottleneck. Data skipping is a promising technique to reduce the number of data accesses. Data skipping partitions data into pages and accesses only pages that contain data to be retrieved by a query. Therefore, effective data partitioning is required to minimize the number of page accesses. However, it is an NP-hard problem to obtain optimal data partitioning given query pattern and data distribution.

We propose a framework that involves a multidimensional indexing technique based on a space-filling curve. A space-filling curve is a way to define which portion of data can be stored in the same page. Therefore, the problem can be interpreted as selecting a curve that distributes data to be accessed by a query to minimize the number of page accesses. To solve this problem, we analyzed how different space-filling curves affect the number of page accesses. We found that it is critical for a curve to fit a query pattern and be robust against any data distribution. We propose a cost model for measuring how well a space-filling curve fits a given query pattern and tolerates data skew. Also we propose a method for designing a query-aware and skew-tolerant curve for a given query pattern.

We prototyped our framework using the defined query-aware and skew-tolerant curve. We conducted experiments using a skew data set, and confirmed that our framework can reduce the number of page accesses by an order of magnitude for data warehousing (DWH) and geographic information systems (GIS) applications with real-world data.

摘要
近来,海量数据管理在数据分析中扮演着越来越重要的角色,因为数据访问是一个主要的瓶颈。数据跳过是减少数据访问数量的一种很有前途的技术。数据跳过将数据划分到页面中,只访问包含查询检索的数据的页面。因此,需要有效的数据分区来最小化页面访问的数量。然而,在给定查询模式和数据分布的情况下,获得最优数据分区是一个np难问题。
我们提出了一个基于空间填充曲线的多维索引技术框架。空格填充曲线是一种定义哪部分数据可以存储在同一个页面中的方法。因此,可以将问题解释为选择一条曲线,该曲线分配要被查询访问的数据,以最小化页面访问的数量。为了解决这个问题,我们分析了不同的空间填充曲线如何影响页面访问的数量。我们发现,曲线适合查询模式并对任何数据分布都具有鲁棒性是至关重要的。我们提出了一个成本模型,用于度量空间填充曲线与给定查询模式的匹配程度以及对数据倾斜的容忍程度。我们还提出了一种针对给定查询模式设计可查询和容忍斜交的曲线的方法。
我们使用定义的查询感知曲线和斜线容忍曲线原型化我们的框架。我们使用一个倾斜数据集进行了实验,并确认了我们的框架可以将数据仓库(DWH)和地理信息系统(GIS)应用程序的页面访问数量减少一个数量级。
【单词】 skew:n. 斜交; 歪斜; adj. 斜交的; 歪斜的;

3.18 Optimization and Main Memory 优化和主存

Optimizing Iceberg Queries with Complex Joins
使用复杂连接优化冰山查询

Abstract
Iceberg queries, commonly used for decision support, find groups whose aggregate values are above or below a threshold. In practice, iceberg queries are often posed over complex joins that are expensive to evaluate. This paper proposes a framework for combining a number of techniques—a-priori, memoization, and pruning—to optimize iceberg queries with complex joins. A-priori pushes partial GROUP BY and HAVING condition before a join to reduce its input size. Memoization caches and reuses join computation results. Pruning uses cached results to infer that certain tuples cannot contribute to the final query result, and short-circuits join computation. We formally derive conditions for correctly applying these techniques. Our practical rewrite algorithm produces highly efficient SQL that can exploit combinations of optimization opportunities in ways previously not possible. We evaluate our PostgreSQL-based implementation experimentally and show that it outperforms both baseline PostgreSQL and a commercial database system.

摘要
冰山查询(通常用于决策支持)查找聚合值高于或低于阈值的组。在实践中,冰山查询经常出现在复杂的连接上,而这些连接的计算成本很高。本文提出了一种结合若干技术-A-priori算法、memoization和剪枝 -来优化复杂连接的冰山查询的框架。A-priori算法在join前执行Group By和Having操作,以有效减少其输入大小。记忆缓存(v.)并重用连接计算结果。剪枝(pruning)使用缓存的结果推断出某些元组无法生成最终的查询结果,并使用短路连接(join)计算结果。我们正式地推导出正确应用这些技术的条件。我们实用的重写算法生成高效的SQL,可以使用以前不可能的方式利用优化方式进行组合。我们对基于PostgreSQL的实现进行了实验性的评估,并显示它优于baseline PostgreSQL和商业数据库系统。
【单词】
memoization:记忆


The Dynamic Yannakakis Algorithm: Compact and Efficient Query Processing Under Updates
动态Yannakakis算法:紧凑和高效的查询处理下的更新

ABSTRACT
Modern computing tasks such as real-time analytics require refresh of query results under high update rates. Incremental View Maintenance (IVM) approaches this problem by materializing results in order to avoid recomputation. IVM naturally induces a trade-off between the space needed to maintain the materialized results and the time used to process updates. In this paper, we show that the full materialization of results is a barrier for more general optimization strategies. In particular, we present a new approach for evaluating queries under updates. Instead of the materialization of results, we require a data structure that allows: (1) linear time maintenance under updates, (2) constant-delay enumeration of the output, (3) constant-time lookups in the output, while (4) using only linear space in the size of the database. We call such a structure a Dynamic Constantdelay Linear Representation (DCLR) for the query. We show that Dyn, a dynamic version of the Yannakakis algorithm, yields DCLRs for the class of free-connex acyclic CQs. We show that this is optimal in the sense that no DCLR can exist for CQs that are not free-connex acyclic. Moreover, we identify a sub-class of queries for which Dyn features constant-time update per tuple and show that this class is maximal. Finally, using the TPC-H and TPC-DS benchmarks, we experimentally compare Dyn and a higherorder IVM (HIVM) engine. Our approach is not only more efficient in terms of memory consumption (as expected), but is also consistently faster in processing updates.

摘要
现代计算任务,如实时分析,要求在高更新率下刷新查询结果。增量视图维护(IVM)通过实现结果来解决这个问题,以避免重新计算。IVM自然地会在需要的空间之间进行权衡,以维持物化的结果和用于处理更新的时间。在本文中,我们证明了结果的完全物化是更一般的优化策略的一个障碍。特别是,我们提出了一种评估更新中查询的新方法。我们不需要实现结果,我们需要一个数据结构,它允许:(1)在更新下进行线性时间维护,(2)输出的常量延迟枚举,(3)输出中的常量时间查找,(4)仅使用数据库大小的线性空间。我们称这种结构为查询的动态常量延迟线性表示(DCLR)。我们展示了Dyn, Yannakakis算法的一个动态版本,为自由connex无环CQs类生成DCLRs。我们证明这是最优的,因为没有DCLR可以存在于非自由-connex无环的cq中。此外,我们还确定了查询的一个子类,其中Dyn的特性是每个元组进行固定时间更新,并显示这个类是最大值。最后,使用TPC-H和TPC-DS基准,我们在实验上比较了Dyn和一个高阶IVM (HIVM)引擎。我们的方法不仅在内存消耗方面(如预期)更有效,而且在处理更新时也总是更快。


Revisiting Reuse in Main Memory Database Systems
主内存数据库系统中的重新访问重用

ABSTRACT
Reusing intermediates in databases to speed-up analytical query processing was studied in prior work. Existing solutions require intermediate results of individual operators to be materialized using materialization operators. However, inserting such materialization operations into a query plan not only incurs additional execution costs but also often eliminates important cache- and register-locality opportunities, resulting in even higher performance penalties. This paper studies a novel reuse model for intermediates, which caches internal physical data structures materialized during query processing (due to pipeline breakers) and externalizes them so that they become reusable for upcoming operations. We focus on hash tables, the most commonly used internal data structure in main memory databases to perform join and aggregation operations. As queries arrive, our reuseaware optimizer reasons about the reuse opportunities for hash tables, employing cost models that take into account hash table statistics together with the CPU and data movement costs within the cache hierarchy. Experimental results, based on our prototype implementation, demonstrate performance gains of 2× for typical analytical workloads with no additional overhead for materializing intermediates.

摘要
在以前的工作中,研究了数据库中的重用中间事务(中间件)来加速分析查询处理。现有的解决方案要求中间结果是使用具体化操作符单独实现每个具体操作符的中间结果。然而,将此类具体操作加入到查询计划中不仅会增加执行成本,而且通常还会消除重要的缓存和寄存到本地机会,从而导致更高的性能损失。本文研究了一种新的中间件重用模型,该模型缓存在查询处理过程中出现的内部物理数据结构(由于管道中断),并将其外部化,使其可以应用于后续的操作中。我们重点关注哈希表,这是主内存数据库中最常用的内部数据结构,用于执行连接和聚合操作。当查询到来时,我们的reuse-aware优化器考虑了哈希表的重用机会,包括在CPU下哈希表统计的成本模型,以及缓存层次结构中的数据移动成本。实验结果,根据我们的真实实现,证明可以性能提升2×,相比于其他没有额外开销的实例化的中间件的分析工作量来讲。

【单词】Reuse:n.重新使用,再用;vt.再使用;
materialization:n. 物质化;实体化;具体化;
internal:adj. 内部的;里面的;体内的;(机构)内部的;
reusable:adj.可再利用的,可重复利用的;


Leveraging Re-costing for Online Optimization of Parameterized Queries with Guarantees
利用重新计算成本对参数化查询进行在线优化,并提供保证
prototype:n. 原型;标准,模范


Handling Environments in a Nested Relational Algebra with Combinators and an Implementation in a Verified Query Compiler
使用组合器和验证查询编译器中的实现处理嵌套关系代数中的环境


From In-Place Updates to In-Place Appends: Revisiting Out-of-Place Updates on Flash
从原地更新到原地追加:重新访问Flash上的外部更新

数据库升级、打补丁是我们经常面对的日常工作内容。在正常情况下,两个因素是我们必须要考虑的问题:停机时间窗和回退方案。就 Oracle 而言,即便是最简单的更新操作,都难以做到 “零停机”。回退方案是在一旦发现新版本存在问题,迅速的回退到原有的版本,支持应用访问。

目前,Oracle 推荐两种大规模升级的方法:In-Place 和 Out-of-Place。In Place 升级方法下,升级动作直接在原有的 Database Home 目录下。Out-of-Place 则是选择了一个新的 Oracle Database Home 目录。相对于 In place 策略,Out-of-Place 在空间上需要更多的消耗。

但是,Out-of-Place 的好处也是比较明显的,首先是可以比较方便的进行回退,同时在 Downtime 停机时间上,也有比较强的优势。

3.19 Privacy

Pufferfish Privacy Mechanisms for Correlated Data


Bolt-on Differential Privacy for Scalable Stochastic Gradient Descent-based Analytics


Pythia: Data Dependent Differentially Private Algorithm Selection


Utility Cost of Formal Privacy for Releasing National Employer-Employee Statistics

3.20 Crowdsourcing 众包

Crowdsourced Top-k Queries by Confidence-Aware Pairwise Judgments
通过对信任的成对发生的判断,众包Top-k查询

Abstract
Crowdsourced query processing is an emerging processing technique that tackles computationally challenging problems by human intelligence. The basic idea is to decompose a computationally challenging problem into a set of human friendly microtasks (e.g., pairwise comparisons) that are distributed to and answered by the crowd. The solution of the problem is then computed (e.g., by aggregation) based on the crowdsourced answers to the microtasks. In this work, we attempt to revisit the crowdsourced processing of the topk queries, aiming at (1) securing the quality of crowdsourced comparisons by a certain confidence level and (2) minimizing the total monetary cost. To secure the quality of each paired comparison, we employ two statistical tools, Student’s tdistribution estimation and Stein’s estimation, to estimate the confidence interval of the underlying mean value, which is then used to draw a conclusion to the comparison. Based on the pairwise comparison process, we attempt to minimize the monetary cost of the top-k processing within a SelectPartition-Rank framework. Our experiments, conducted on four real datasets, demonstrate that our stochastic method outperforms other existing top-k processing techniques by a visible difference.

摘要:
众包查询处理是一种新兴的处理技术,它通过人工智能处理具有计算挑战性的问题。基本的想法是将一个计算上有挑战性的问题分解成一组人类友好的微任务(例如,成对的比较),这些微任务由人群分配并回答。然后根据对微任务的众包回答计算问题的解决方案(例如,通过聚合)。在本文中,我们试图重新审视topk查询的众包处理,目的是(1)在一定的置信水平上保证众包比较的质量,(2)最小化总货币成本。为了保证每一个配对比较的质量,我们采用了两个统计工具,学生的t-分布估计和Stein的估计,来估计隐含平均值的置信区间,然后用它来得出比较的结论。基于两两比较过程,我们尝试在select-partition-rank框架中最小化top-k处理的货币成本。我们在四个实际数据集上进行的实验表明,我们的随机方法比其他现有的top-k处理技术有明显的差异。


Falcon: Scaling Up Hands-Off Crowdsourced Entity Matching to Build Cloud Services
Falcon:增加不干涉的众包实体匹配来构建云服务


CrowdDQS: Dynamic Question Selection in Crowdsourcing Systems
众包系统中的动态问题选择


CDB: Optimizing Queries with Crowd-Based Selections and Joins
CDB:使用基于群体的选择和连接来优化查询

你可能感兴趣的:(paper)