(18th USENIX Conference on File and Storage Technologies)
https://www.usenix.org/conference/fast20/technical-sessions
只简单整理了目录及摘要,仅供参考
更多内容点击
https://bbs.geekscholar.net/d/19-fast-2020
Cloud Storage
[2020/02/17] MAPX: Controlled Data Migration in the Expansion of Decentralized Object-Based Storage Systems
[2020/02/17] Lock-Free Collaboration Support for Cloud Storage Services with Operation Inference and Transformation
[2020/02/24] POLARDB Meets Computational Storage: Efficiently Support Analytical Workloads in Cloud-Native Relational Database
File Systems
[2020/02/24] Carver: Finding Important Parameters for Storage System Tuning
[2020/02/24] Read as Needed: Building WiSER, a Flash-Optimized Search Engine
[2020/02/24*] How to Copy Files
HPC Storage
[2020/02/24*] Uncovering Access, Reuse, and Sharing Characteristics of I/O-Intensive Files on Large-Scale Production HPC Systems
[2020/02/24*] GIFT: A Coupon Based Throttle-and-Reward Mechanism for Fair and Efficient I/O Bandwidth Management on Parallel Storage
Systems
SSD and Reliability
[2020/02/27*] Scalable Parallel Flash Firmware for Many-core Architectures
[2020/02/27] A Study of SSD Reliability in Large Scale Enterprise Storage Deployments
[2020/02/27] Making Disk Failure Predictions SMARTer!
Performance
[2020/03/02*] An Empirical Guide to the Behavior and Use of Scalable Persistent Memory
[2020/03/02] DC-Store: Eliminating Noisy Neighbor Containers using Deterministic I/O Performance and Resource Isolation
[2020/03/02] GoSeed: Generating an Optimal Seeding Plan for Deduplicated Storage
Key Value Storage
[2020/02/29] Characterizing, Modeling, and Benchmarking RocksDB Key-Value Workloads at Facebook
[2020/03/01*] FPGA-Accelerated Compactions for LSM-based Key-Value Store
[2020/03/01*] HotRing: A Hotspot-Aware In-Memory Key-Value Store
Caching
[2020/03/03*] BCW: Buffer-Controlled Writes to HDDs for SSD-HDD Hybrid Storage Server
[2020/03/05] InfiniCache: Exploiting Ephemeral Serverless Functions to Build a Cost-Effective Memory Cache
[2020/03/07] Quiver: An Informed Storage Cache for Deep Learning
Consistency and Reliability
[2020/02/29] CRaft: An Erasure-coding-supported Version of Raft for Reducing Storage Cost and Network Cost
[2020/02/29*] Hybrid Data Reliability for Emerging Key-Value Storage Devices
[2020/02/27] Strong and Efficient Consistency with Consistency-Aware Durability
Authors: Li Wang, Didi Chuxing; Yiming Zhang, NiceX Lab, NUDT; Jiawei Xu and Guangtao Xue, SJTU
Abstract: Data placement is critical for the scalability of decentralized object-based storage systems. The state-of-the-art CRUSH placement method is a
decentralized algorithm that deterministically places object replicas onto storage devices without relying on a central directory. While enjoying the
benefits of decentralization such as high scalability, robustness, and performance, CRUSH-based storage systems suffer from uncontrolled data migration when expanding the clusters, which will cause significant performance degradation when the expansion is nontrivial.
This paper presents MAPX, a novel extension to CRUSH that uses an extra time-dimension mapping (from object creation times to cluster expansion times) for controlled data migration in cluster expansions. Each expansion is viewed as a new layer of the CRUSH map represented by a virtual node beneath the CRUSH root. MAPX controls the mapping from objects onto layers by manipulating the timestamps of the intermediate placement groups (PGs). MAPX is applicable to a large variety of object-based storage scenarios where object timestamps can be maintained as higher-level metadata. For example, we apply MAPX to Ceph-RBD by extending the RBD metadata structure to maintain and retrieve approximate object creation times at the granularity of expansions layers. Experimental results show that the MAPX-based migration-free system outperforms the CRUSH-based system (which is busy in migrating objects after expansions) by up to 4.25× in the tail latency.
备忘录
数据放置(data placement)在分散对象存储系统中非常重要。
最新的CRUSH放置方法是一种去中心化算法,可以在不使用分类器的情况下将对象的副本果断地放置在存储设备上。基于CRUSH的存储系统享有分散的好处,例如高可伸缩性,健壮性和性能,而随着群集的增长,不受控制的数据迁移会降低性能。
本文提出了MAPX,它扩展了CRUSH。
CRUSH使用额外的时间维度映射(从对象创建时间到群集扩展时间|from object creation times to cluster expansion times)在群集扩展期间进行受控数据迁移。
每个集群扩展 ( expansion ) 都被视为CRUSH映射的新层,由CRUSH根下的虚拟节点表示,MAPX通过操纵中间放置组(PGs)的时间戳来控制从对象到层的映射。
MAPX适用于各种各样的基于对象的存储场景,在这些场景中,对象时间戳可以作为高级元数据进行维护,例如,我们通过扩展RBD元数据结构来将MAPX应用于Ceph-RBD,以维护和检索近似的对象创建时间扩展层的粒度。
Experimental results show that the MAPX-based migration-free system outperforms the CRUSH-based system (which is busy in migrating objects after expansions) by up to 4.25× in the tail latency.
杂记
对于基于MAPX的系统是否“无迁移”?
CRUSH 是什么?
像P2P一致哈希一样? ?? 显然,它是为Ceph设计的。(没怎么读过Ceph论文)
Ceph-RBD 是什么?
结果如果有extra time-dimension mapping的话,数据迁移会怎么样呢?
Authors: Jian Chen, Minghao Zhao, and Zhenhua Li, Tsinghua University; Ennan Zhai, Alibaba Group Inc.; Feng Qian, University of Minnesota - Twin Cities;
Hongyi Chen, Tsinghua University; Yunhao Liu, Michigan State University & Tsinghua University; Tianyin Xu, University of Illinois Urbana-Champaign
Abstract: This paper studies how today’s cloud storage services support collaborative file editing. As a tradeoff for transparency/user-friendliness,
they do not ask collaborators to use version control systems but instead implement their own heuristics for handling conflicts, which however often lead
to unexpected and undesired experiences. With measurements and reverse engineering, we unravel a number of their design and implementation issues as the root causes of poor experiences. Driven by the findings, we propose to reconsider the collaboration support of cloud storage services from a novel perspective of operations without using any locks. To enable this idea, we design intelligent approaches to the inference and transformation of users’ editing operations, as well as optimizations to the maintenance of files’ historic versions. We build an open-source system UFC2 (User-Friendly
Collaborative Cloud) to embody our design, which can avoid most (98%) conflicts with little (2%) time overhead.
备忘录
研究当今的云存储服务如何支持文件编辑协作。
Collaborators
他们不是使用版本控制系统,而是使用特定于服务的启发式方法来处理编辑冲突。 这通常会导致意外的结果和用户体验。 该区域是透明性和用户友好性之间的折衷。
通过测量服务和逆向工程( measurements and reverse engineering ),作者已经确定了设计和实施问题,这些问题是不良体验的根本原因。
基于这一发现,作者从不使用任何锁的操作角度重新审视了一种支持协作的云存储服务。
为了实现这个想法,作者设计了一种智能方法来优化文件历史版本的保留,并推断和改变用户的编辑过程。
结合提出的方法,构建并测量了一个开源系统 UFC2 (User-Friendly Collaborative Cloud)。 仅需2%的时间开销,我们就可以避免多达98%的冲突。
杂记
Authors: Wei Cao, Alibaba; Yang Liu, ScaleFlux; Zhushi Cheng, Alibaba; Ning Zheng, ScaleFlux; Wei Li and Wenjie Wu, Alibaba; Linqiang Ouyang, ScaleFlux;
Peng Wang and Yijing Wang, Alibaba; Ray Kuan, ScaleFlux; Zhenjun Liu and Feng Zhu, Alibaba; Tong Zhang, ScaleFlux
Abstract: This paper reports the deployment of computational storage drives in Alibaba Cloud, aiming to enable cloud-native relational database cost-effectively support analytical workloads. With its compute-storage decoupled architecture, cloud-native relational database must proactively
pushdown certain data-intensive tasks (e.g., table scan) from front-end database nodes to back-end storage nodes in order to effectively support analytical workloads. This however makes it a challenge to maintain the cost effectiveness of storage nodes. The emerging computational storage opens a new opportunity to address this challenge: By replacing commodity SSDs with computational storage drives, storage nodes can leverage the in-storage computing power to much more efficiently serve table scans. Practical implementation of this simple idea is non-trivial and demands cohesive innovations across the software (i.e.,database, filesystem and I/O) and hardware (i.e., computational storage drive) layers. This paper reports a holistic implementation for Alibaba cloud-native relational database POLARDB and its deployment in Alibaba Cloud. This paper discusses the major implementation challenges, and presents the design techniques that have been developed to tackle these challenges. To the best of our knowledge, this is the first real-world deployment of cloud-native databases with computational storage drives ever reported in the open literature.
备忘录
有关阿里云中计算存储驱动器部署(computational storage drive) 的报告。我们的目标是建立一个具有成本效益的关系数据库,该数据库是云原生的并支持分析工作负载。
云原生关系数据库是使用计算存储设备的一项艰巨的数据密集型任务,您需要从前端数据库节点移动(例如,表扫描)到后端存储节点。此属性使保持存储节点的成本效益具有挑战性。
通过用计算存储驱动器 (computational storage drive) 替换商品SSD,存储节点可以利用存储中的计算能力来执行更高效的表扫描。 这个想法很简单,但在实际实施中并非不言自明,它需要从软件(即数据库,文件系统,I / O)到硬件(即计算存储驱动器)层的紧密创新。
本文介绍了阿里云本地关系数据库POLARDB的总体实现及其在阿里云上的部署。
杂记
当我搜索POLARDB时,我发现了2018年4月的官方博客文章。 它似乎写得很详细(我还没看过)
“PolarDB: Deep Dive on Alibaba Cloud’s Next-Generation Database”
https://www.alibabacloud.com/blog/deep-dive-on-alibaba-clouds-next-generation-database_578138
TODO:阅读
“computational storage drive”
我不知道它是什么,所以我查了一下,发现它是一种可以在存储端处理数据而不是将数据传输到主存储器并在CPU上进行处理的设备。 是否类似于NIC卸载处理?
Authors: Zhen Cao, Stony Brook University; Geoff Kuenning, Harvey Mudd College; Erez Zadok, Stony Brook University
Abstract: Storage systems usually have many parameters that affect their behavior. Tuning those parameters can provide significant gains in performance. Alas, both manual and automatic tuning methods struggle due to the large number of parameters and exponential number of possible configurations. Since previous research has shown that some parameters have greater performance impact than others, focusing on a smaller number of more important parameters can speed up auto-tuning systems because they would have a smaller state space to explore. In this paper, we propose Carver, which uses (1) a variance-based metric to quantify storage parameters’ importance, (2) Latin Hypercube Sampling to sample huge parameter spaces; and (3) a greedy but efficient parameter-selection algorithm that can identify important parameters. We evaluated Carver on datasets consisting of more than 500,000 experiments on 7 file systems, under 4 representative workloads. Carver successfully identified important parameters for all file systems and showed that importance varies with different workloads. We demonstrated that Carver was able to identify a near-optimal set of important
parameters in our datasets. We showed Carver’s efficiency by testing it with a small fraction of our dataset; it was able to identify the same set of important parameters with as little as 0.4% of the whole dataset.
备忘录
存储系统具有影响其行为的各种参数。 通过调整这些参数可以获得较大的性能增益。 由于大量的参数和配置,手动和自动参数调整技术在此过程中遇到了困难。
以前的研究表明,某些参数比其他参数对性能改进的贡献更大。 您可以通过关注一些更重要的参数(更小的搜索空间)来加快自动调整系统的速度。
在本文中,我们提出了Caver。
(2) 使用拉丁超立方抽样方法(LatinHypercube Sampling) 对巨大的参数空间进行采样。
Carver 评价
Carver识别了所有文件系统上的重要参数,并表明其重要性取决于工作负载。
杂记
什么是 Latin Hypercube Sampling ?
https://en.wikipedia.org/wiki/Latin_hypercube_sampling
“拉丁方阵是一种n×n的方阵,方阵中恰有n种不同的元素,每种元素恰有n个,并且每种元素在一行和一列中 恰好出现一次。拉丁超立方则是拉丁方阵在多维中的推广,每个与轴垂直的超平面最多含有一个样本。”
基于此的实验计划方法称为Latin Hypercube Sampling(LHS)
我想知道重要的参数是什么以及如何使用LHS? ??
Authors: Jun He and Kan Wu, University of Wisconsin—Madison; Sudarsun Kannan, Rutgers University; Andrea Arpaci-Dusseau and Remzi Arpaci-Dusseau,University of Wisconsin—Madison
Abstract: We describe WiSER, a clean-slate search engine designed to exploit high-performance SSDs with the philosophy “read as needed”. WiSER utilizes many techniques to deliver high throughput and low latency with a relatively small amount of main memory; the techniques include an optimized data layout, a novel two-way cost-aware Bloom filter, adaptive prefetching, and space-time trade-offs. In a system with memory that is significantly smaller than the working set, these techniques increase storage space usage (up to 50%), but reduce read amplification by up to 3x, increase query throughput by up to 2.7x, and reduce latency by 16x when compared to the state-of-the-art Elasticsearch. We believe that the philosophy of “read as needed” can be applied to more applications as the read performance of storage devices keeps improving.
备忘录
本文介绍了一种干净的搜索引擎WiSER。 WiSER旨在基于’‘按需读取’'的思想来利用高性能SSD。
WiSER广泛使用了一些技术,可以使用相对较小的主内存来实现高吞吐量和低延迟。 这些技术包括数据布局优化,新颖的双向成本感知布隆过滤器,自适应预取和时空权衡 ( two-way cost-aware Bloom filter、adaptive prefetching、space-time trade-offs )。
在内存远远少于工作集的系统中,这些技术可增加存储空间使用率(最多50%),但将读取放大率降低到3倍,并将查询吞吐量提高到2.7倍,且延迟可以减少16倍(与SOTA Elasticsearch相比)。
作者认为,read as need
按需读取’'的思想可以应用于更多应用程序并提高存储设备的读取性能。
杂记
什么是“按需阅读”。 从布隆过滤器(Bloom filter )和自适应预取(adaptive prefetching)之类的技术中,我们将在预读时尽最大努力只读取必要的数据。
novel two-way cost-aware Bloom filter ?
Authors: Yang Zhan, The University of North Carolina at Chapel Hill and Huawei; Alexander Conway, Rutgers University; Yizheng Jiao and Nirjhar
Mukherjee, The University of North Carolina at Chapel Hill; Ian Groombridge, Pace University; Michael A. Bender, Stony Brook University; Martin
Farach-Colton, Rutgers University; William Jannen, Williams College; Rob Johnson, VMWare Research; Donald E. Porter, The University of North Carolina at
Chapel Hill; Jun Yuan, Pace University
Abstract: Making logical copies, or clones, of files and directories is critical to many real-world applications and workflows, including backups, virtual machines, and containers. An ideal clone implementation meets the following performance goals: (1) creating the clone has low latency; (2) reads are fast in all versions (i.e., spatial locality is always maintained, even after modifications); (3) writes are fast in all versions; (4) the overall system is space efficient. Implementing a clone operation that realizes all four properties, which we call a nimble clone, is a long-standing open problem.
This paper describes nimble clones in BetrFS, an open-source, full-path-indexed, and write-optimized file system. The key observation behind our work is that standard copy-on-write heuristics can be too coarse to be space efficient, ortoo fine-grained to preserve locality. On the other hand, a write-optimized key-value store, as used in BetrFS or an LSM-tree, can decouple the logical application of updates from the granularity at which data is physically copied. In our write-optimized clone implementation, data sharing among clones is only broken when a clone has changed enough to warrant making a copy, a policy we call copy-on-abundant-write.
We demonstrate that the algorithmic work needed to batch and amortize the cost of BetrFS clone operations does not erode the performance dvantages of baseline BetrFS; BetrFS performance even improves in a few cases. BetrFS cloning is efficient; for example, when using the clone operation for container creation, BetrFS outperforms a simple recursive copy by up to two orders-of-magnitude and outperforms file systems that have specialized LXC backends by 3-4×
备忘录
文件和目录的逻辑复制(logic copy)和克隆(clone)是许多实际应用程序和工作流(备份,虚拟机,容器等)的关键部分。
理想的 clone 处理实现可满足以下性能目标:
(1) clone 可以以低延迟创建
(2) 在所有版本中都可以快速读取(即,即使文件更改后,空间局部性也始终保持不变)
(3) 可以在所有版本中高速写入
(4) 整个系统节省空间。
满足所有四个特征的克隆处理 (nimble clone) 的实现长期以来一直是一个未解决的问题 (open problem) 。
本文介绍了BetrFS中的 nimble clone。BetrFS 是一个开源的,全路径索引(full-path-indexed),写优化(write-optimized)的文件系统。支持这项研究的重要观察结果是,标准的写时复制启发式方法(copy-on-write heuristics)对于空间效率而言是粗粒度的,而对于保持局部性而言则过于细粒度。另一方面,写优化的键值存储(用于BetrFS或anLSM树中)可以将更新的理论应用与物理复制数据的粒度分开。
(“… can decouple the logical application of updates from the granularity at which data is physically copied”)。在我们的“写优化克隆(write-optimized clone)”实现中,仅当对克隆所做的更改具有足够的“保证力 warrant”(合理的??)以进行复制时,才会中断克隆之间的数据共享。作者将其称为“大量写入时复制(copy-on-abundant-write)”。
在本文中,“摊销(amortize)”批处理成本和“ BetrFS clone operations”所需的算法工作是基线BetrFS。
结果表明,其性能优势没有受到损害。 相反,在某些情况下,它可以提高BetrFS的性能。 BetrFS克隆非常有效。 例如,当使用克隆操作进行容器创建时,BetrFS比简单的重复副本好大约两个数量级,比为LXC后端优化的文件系统好3-4倍。
杂记
什么是BetrFS?
https://www.betrfs.org/
“The Bε-tree File System, or BetrFS, is an in-kernel file system that uses Bε trees to organize on-disk storage. Bε trees are a write-optimized dictionary, and offer the same asymptotic behavior for sequential I/O and point queries as a B-tree. The advantage of a Bε tree is that it can also ingest small, random writes 1-2 orders of magnitude faster than B-trees and other standard on-disk data structures.”
“The goal of BetrFS is to realize performance that strictly dominates the performance of current, general-purpose file systems.”
作者是BetrFS的开发成员
什么是 abundant write ?
Authors: Tirthak Patel, Northeastern University; Suren Byna, Glenn K.Lockwood, and Nicholas J. Wright, Lawrence Berkeley National Laboratory; Philip
Carns and Robert Ross, Argonne National Laboratory; Devesh Tiwari, Northeastern University
Abstract: Large-scale high-performance computing (HPC) applications running on supercomputers produce large amounts of data routinely and store it in files on multi-PB shared parallel storage systems. Unfortunately, storage community has a limited understanding of the access and reuse patterns of these files. This paper investigates the access and reuse patterns of I/O- intensive files on a production-scale supercomputer.
备忘录
Authors: Tirthak Patel, Northeastern University; Rohan Garg, Nutanix; Devesh Tiwari, Northeastern University
Abstract: Large-scale parallel applications are highly data-intensive and perform terabytes of I/O routinely. Unfortunately, on a large-scale system where
multiple applications run concurrently, I/O contention negatively affects system efficiency and causes unfair bandwidth allocation among applications. To address these challenges, this paper introduces GIFT, a principled dynamic approach to achieve fairness among competing applications and improve system efficiency.
杂记
“principled dynamic approach”
我不确定,但是感觉像是根据某些规则动态调整的方法吗?
论文的标题是 “A Coupon Based Throttle-and-Reward Mechanism for Fair and Efficient I/O Bandwidth Management on Parallel Storage
Systems” 。 因此,原则上的动态方法(principled dynamic approach)可能是通过 coupon based throttled-and-reward mechanism 动态调整的方法。以后是否可以通过对接受节流的应用程序给予称为 coupon 的奖励来给予优惠待遇?
杂记
Authors: Jie Zhang and Miryeong Kwon, KAIST; Michael Swift, University of Wisconsin–Madison; Myoungsoo Jung, KAIST
Abstract: NVMe is designed to unshackle flash from a traditional storage bus by allowing hosts to employ many threads to achieve higher bandwidth. While NVMe enables users to fully exploit all levels of parallelism offered by modern SSDs, current firmware designs are not scalable and have difficulty in handling a large number of I/O requests in parallel due to its limited computation power and many hardware contentions.
We propose DeepFlash, a novel manycore-based storage platform that can process more than a million I/O requests in a second (1MIOPS) while hiding long latencies imposed by its internal flash media. Inspired by a parallel data analysis system, we design the firmware based on many-to-many threading model that can be scaled horizontally. The proposed DeepFlash can extract the maximum performance of the underlying flash memory complex by concurrently executing multiple firmware components across many cores within the device. To show its extreme parallel scalability, we implement DeepFlash on a many-core prototype processor that employs dozens of lightweight cores, analyze new challenges from parallel I/O processing and address the challenges by applying concurrency-aware optimizations. Our comprehensive evaluation reveals that DeepFlash can serve around 4.5 GB/s, while minimizing the CPU demand on microbenchmarks and real server workloads.
备忘录
NVMe
NVMe旨在允许主机使用多个线程来实现高带宽。 NVMe允许您利用现代SSD提供的所有级别的并行性,而由于计算限制和硬件争用,当前的固件设计无法扩展,并且I / O数量众多。 难以并行处理请求。
在本文中,我们提出了一种称为DeepFlash的机制。
DeepFlash是一个基于多核的存储平台,每秒可以处理100万个I / O请求(1 MIOPS),同时隐藏由于内部闪存介质而导致的长时间延迟。
受并行数据分析系统的启发,该固件采用多对多线程模型进行设计。
通过跨多个内核并行执行多个固件组件,可以最大化闪存的性能。
在具有大量轻量级内核的多核处理器原型上实现DeepFlash,以展示可扩展性。 致力于并行IO处理和并发优化的分析。
Authors: Stathis Maneas and Kaveh Mahdaviani, University of Toronto; Tim Emami, NetApp; Bianca Schroeder, University of Toronto
Abstract: This paper presents the first large-scale field study of NAND-based SSDs in enterprise storage systems (in contrast to drives in
distributed data center storage systems). The study is based on a very comprehensive set of field data, covering 1.4 million SSDs of a major storage
vendor (NetApp). The drives comprise three different manufacturers, 18 different models, 12 different capacities, and all major flash technologies (SLC, cMLC, eMLC, 3D-TLC). The data allows us to study a large number of factors that were not studied in previous works, including the effect of firmware versions, the reliability of TLC NAND, and correlations between drives within a RAID system. This paper presents our analysis, along with a number of practical implications derived from it.
备忘录
介绍了针对企业存储系统中基于NAND的SSD进行的首次大规模 field study 的结果。
(in contrast to drives in distributed data center storage systems)
这项研究基于一组非常广泛的现场数据,这些数据涵盖了来自领先存储供应商NetApp的140万个SSD。
驱动器包括3个不同的制造商,18个型号,12个容量和主要的闪存技术(SLC,cMLC,eMLC,3D-TLC)。
(SLC (Single Level Cell), MLC (Multi Level Cell), cMLC consumer MLC), eMLC (enterprise MLC), TLC (Triple Level Cell), 3D-TLC 堆叠
从这些数据中,我们能够研究许多以前研究中无法研究的因素(例如固件版本的影响,TLC NAND可靠性以及RAID系统中驱动器之间的相关性)。
Authors: Sidi Lu and Bing Luo, Wayne State University; Tirthak Patel, Northeastern University; Yongtao Yao, Wayne State University; Devesh Tiwari,
Northeastern University; Weisong Shi, Wayne State University
Abstract: Disk drives are one of the most commonly replaced hardware components and continue to pose challenges for accurate failure prediction. In
this work, we present analysis and findings from one of the largest disk failure prediction studies covering a total of 380,000 hard drives over a period of two months across 64 sites of a large leading data center operator. Our proposed machine learning based models predict disk failures with 0.95 F-measure and 0.95 Matthews correlation coefficient (MCC) for 10-days prediction horizon on average.
备忘录
磁盘驱动器是最常更换的硬件组件,准确的故障预测是一个挑战。
在这项研究中,我们报告了我们在一个跨越64个站点的大型数据中心进行的两个月以上的总共380,000个硬盘驱动器的磁盘故障预测调查中所学的内容。
Our proposed machine learning based models predict disk failures with 0.95 F-measure and 0.95 Matthews correlation coefficient (MCC) for 10-days
prediction horizon on average.
Authors: Jian Yang, Juno Kim, and Morteza Hoseinzadeh, UC San Diego; Joseph Izraelevitz, University of Colorado, Boulder; Steve Swanson, UC San Diego
Abstract: After nearly a decade of anticipation, scalable nonvolatile memory DIMMs are finally commercially available with the release of Intel’s Optane
DIMM. This new nonvolatile DIMM supports byte-granularity accesses with access times on the order of DRAM, while also providing data storage that survives power outages.
Researchers have not idly waited for real nonvolatile DIMMs (NVDIMMs) to arrive. Over the past decade, they have written a slew of papers proposing new programming models, file systems, libraries, and applications built to exploit the performance and flexibility that NVDIMMs promised to deliver. Those papers drew conclusions and made design decisions without detailed knowledge of how real NVDIMMs would behave or how industry would integrate them into computer architectures. Now that Optane NVDIMMs are actually here, we can provide detailed performance numbers, concrete guidance for programmers on these systems, reevaluate prior art for performance, and reoptimize persistent memory software for the real Optane DIMM.
In this paper, we explore the performance properties and characteristics of Intel’s new Optane DIMM at the micro and macro level. First, we investigate the basic characteristics of the device, taking special note of the particular ways in which its performance is peculiar relative to traditional DRAM or other past methods used to emulate NVM. From these observations, we recommend a set of best practices to maximize the performance of the device. With our improved understanding, we then explore and reoptimize the performance of prior art in application-level software for persistent memory.
备忘录
随着Intel Opatane DIMM的发布,DIMM型非易失性存储器终于可以商业使用。 这种新的DIMM非易失性存储器提供了按字节访问的访问时间,访问时间大约为DRAM,并且提供了可以承受断电的数据存储。
研究人员没有等待任何实际操作就可以使用真正的NVDIMM。 在过去的十年中,他们撰写了许多论文,并提出了利用NVDIMM的性能和灵活性的新编程模型文件系统库应用程序。 在撰写这些论文时,并没有详细了解实际的NVDIMM的行为以及如何将它们集成到计算机体系结构中。 现在,Opatane NVDIMM在这里,我们可以在其中提供详细的性能数据,为程序员提供的具体指导,重新评估以前的性能研究以及使用非易失性存储器进行软件优化。
在本文中,我们从微观/宏观角度研究了英特尔傲腾DIMM的性能特征。 首先,我们将研究Intel Optane DIMM的基本功能,尤其要注意模拟经典DRAM和过去的NVM的方法。 从这些观察中,我们提出了一些优化设备性能的最佳实践。 然后,基于获得的知识,我们将调查先前使用非易失性存储器对应用程序级软件进行研究的性能并对其进行优化。
杂记
Authors: Miryeong Kwon, Donghyun Gouk, and Changrim Lee, KAIST; Byounggeun Kim and Jooyoung Hwang, Samsung; Myoungsoo Jung, KAIST
Abstract: We propose DC-store, a storage framework that offers deterministic I/O performance for a multi-container execution environment. DC-store’s
hardware-level design implements multiple NVM sets on a shared storage pool, each providing a deterministic SSD access time by removing internal resource conflicts. In parallel, software support of DC-Store is aware of the NVM sets and enlightens Linux kernel to isolate noisy neighbor containers, performing page frame reclaiming, from peers. We prototype both hardware and software counterparts of DC-Store and evaluate them in a real system. The evaluation results demonstrate that containerized data-intensive applications on DC-Store exhibit 31% shorter average execution time, on average, compared to those on a baseline system.
备忘录
在本文中,我们提出了DC-store。 DC-store是一种存储框架,可在多容器执行环境中提供确定的I / O性能。 DC-store 设备的硬件级设计在共享存储池上实现了多个NVM集,从而通过消除每个集提供的内部资源冲突来提供确定的SSD访问时间
同时,DC-store 软件会识别NVM集,并强制Linux内核检索页面框架以隔离嘈杂的邻居。
DC-store的硬件和软件部分在真实系统中进行了原型设计和评估。实验结果表明,DC-store上的容器化数据密集型应用程序的平均执行时间平均比基准系统短31%。
杂记
由硬件部分和软件部分组成的存储框架。 硬件部门并未自行制造硬件,但感觉就像是在设计硬件配置。
如何确定容器的噪音?
Authors: Aviv Nachman and Gala Yadgar, Technion - Israel Institute of Technology; Sarai Sheinvald, Braude College of Engineering
Abstract: Deduplication decreases the physical occupancy of files in a storage volume by removing duplicate copies of data chunks, but creates
data-sharing dependencies that complicate standard storage management tasks. Specifically, data migration plans must consider the dependencies between files that are remapped to new volumes and files that are not. Thus far, only greedy approaches have been suggested for constructing such plans, and it is unclear how they compare to one another and how much they can be improved.
We set to bridge this gap for seeding—migration in which the target volume is initially empty. We present GoSeed, a formulation of seeding as an integer
linear programming (ILP) problem, and three acceleration methods for applying it to real-sized storage volumes. Our experimental evaluation shows that, while the greedy approaches perform well on “easy” problem instances, the cost of their solution can be significantly higher than that of GoSeed’s solution on “hard” instances, for which they are sometimes unable to find a solution at all.
备忘录
dedup 通过消除重复的数据块来减少存储卷中文件的物理占用。 但是,这将创建一个复杂的存储管理任务,称为数据共享依赖项。 具体来说,在计划数据迁移时,必须考虑源卷中文件之间的依赖性。 到目前为止,仅提出了贪婪方法作为这种计划方法,目前尚不清楚如何比较它们以及如何改进它们。
We set to bridge this gap for seeding—migration in which the target volume is initially empty. We present GoSeed, a formulation of seeding as an
integer linear programming (ILP) problem, and three acceleration methods for applying it to real-sized storage volumes.
根据评价实验,贪心算法对“easy”的问题很好地工作了,对于“hard”的问题(常常找不到解决方案的问题),贪婪的方法比作为提案方法的GoSeed花费了更多成本。
杂记
Authors: Zhichao Cao, University of Minnesota, Twin Cities, and Facebook;Siying Dong and Sagar Vemuri, Facebook; David H.C. Du, University of Minnesota, Twin Cities
Abstract: Persistent key-value stores are widely used as building blocks in today’s IT infrastructure for managing and storing large amounts of data.
However, studies of characterizing real-world workloads for key-value stores are limited due to the lack of tracing/analyzing tools and the difficulty of
collecting traces in operational environments. In this paper, we first present a detailed characterization of workloads from three typical RocksDB production use cases at Facebook: UDB (a MySQL storage layer for social graph data), ZippyDB (a distributed key-value store), and UP2X (a distributed key-value store for AI/ML services). These characterizations reveal several interesting findings: first, that the distribution of key and value sizes are highly related to the use cases/applications; second, that the accesses to key-value pairs have a good locality and follow certain special patterns; and third, that the collected performance metrics show a strong diurnal pattern in the UDB, but not the other two.
We further discover that although the widely used key-value benchmark YCSB provides various workload configurations and key-value pair access distribution models, the YCSB-triggered workloads for underlying storage systems are still not close enough to the workloads we collected due to ignorance of key-space localities. To address this issue, we propose a key-range based modeling and develop a benchmark that can better emulate the workloads of real-world key-value stores. This benchmark can synthetically generate more precise key-value queries that represent the reads and writes of key-value stores to the underlying storage system.
便签
Key-Value
Persistent key-value stores 被广泛使用,但是对其实际工作量的研究却很有限。 这是因为缺少用于跟踪和分析的工具,并且很难收集跟踪信息。
本文介绍了Facebook上RocksDB典型生产用例(3种类型)的工作负载。
UDB (MySQL storage layer for social graph data)
ZippyDB (distributed key-value store)
UP2X (distributed key-value store for AI/ML services)
有趣的发现
key-value 的分布与用例和应用程序有很深的关系。
key-value 访问具有良好的局部性,并遵循某些特殊模
根据收集的绩效指标,UDB显示了很强的昼夜模式 (diurnal pattern),而其他两个则没有。
其他发现
YCSB广泛用作键值基准,它提供了各种工作负载设置和键值访问的分布式模型,但是这些工作负载与我们实际收集的工作负载不同。 这是因为忽略键空间(key-space)的局部性。
为了解决这个问题,我们提出了基于键范围 key-range 的建模并开发了一个基准,可以更好地模拟实际工作负载。 该基准可以综合生成更准确的键值查询。
杂记
是否适用于生产环境中的工作负载研究?
YCSB似乎是由Yahoo! Research在美国开发的基准(Yahoo! CloudService Benchmark)
2010 年|“Benchmarking cloud serving systems with YCSB”
https://blog.acolyer.org/2020/03/11/rocks-db-at-facebook/
Authors: Teng Zhang, Alibaba Group, Alibaba-Zhejiang University Joint Institute of Frontier Technologies, Zhejiang University; Jianying Wang, Xuntao
Cheng, and Hao Xu, Alibaba Group; Nanlong Yu, Alibaba-Zhejiang University Joint Institute of Frontier Technologies, Zhejiang University; Gui Huang, Tieying Zhang, Dengcheng He, Feifei Li, and Wei Cao, Alibaba Group; Zhongdong Huang and Jianling Sun, Alibaba-Zhejiang University Joint Institute of Frontier Technologies, Zhejiang University
Abstract: Log-Structured Merge Tree (LSM-tree) key-value (KV) stores have been widely deployed in the industry due to its high write efficiency and low
costs as a tiered storage. To maintain such advantages, LSM-tree relies on a background compaction operation to merge data records or collect garbages for housekeeping purposes. In this work, we identify that slow compactions jeopardize the system performance due to unchecked oversized levels in the
LSM-tree, and resource contentions for the CPU and the I/O. We further find that the rising I/O capabilities of the latest disk storage have pushed compactions to be bounded by CPUs when merging short KVs. This causes both query/transaction processing and background compactions to compete for the bottlenecked CPU resources extensively in an LSM-tree KV store.
In this paper, we propose to offload compactions to FPGAs aiming at accelerating compactions and reducing the CPU bottleneck for storing short KVs. Evaluations have shown that the proposed FPGA-offloading approach accelerates compactions by 2 to 5 times, improves the system throughput by up to 23%, and increases the energy efficiency (number of transactions per watt) by up to 31.7%, compared with the fine-tuned CPU-only baseline. Without loss of generality, we implement our proposal in X-Engine, a latest LSM-tree storage engine.
备忘录
LSM-tree (Log-Structured Merge Tree) KVS在业界广泛用作分层存储,因为它以低成本提供了高写入效率。为了保持这些优势,LSM树执行后台压缩(数据记录的合并或垃圾收集)以进行内务处理。在此过程中,我们发现“unchecked oversized levels in the LSM-tree’”和由于CPU和I / O资源争用而导致的缓慢压缩会威胁系统性能。作者还发现,合并短的KV时,最新的磁盘存储I / O性能改进驱动了CPU确定的压缩。由于CPU瓶颈,这导致查询事务处理和后台压缩竞争激烈。
在本文中,我们建议将压缩处理转移到FPGA。 这旨在加快压缩速度并消除CPU瓶颈。 根据评估,压缩过程比 fine-tuned CPU-only 基准快2到5倍,系统吞吐量提高到23%,能源效率(每瓦交易数)为31.7%。作者在最新的LSM树存储引擎X-Engine中实现了所提出的方法,而又没有失去其通用性。
杂记
X-Engine|(SIGMOD 2019) “X-Engine: An Optimized Storage Engine for Large-scale E-commerce Transaction Processing”
https://www.cs.utah.edu/~lifeifei/papers/sigmod-xengine.pdf
众所周知,LSM-Tree在压缩期间不会执行太多I / O处理,因此高速设备往往会用完I / O带宽,并由CPU决定速率。
Authors: Teng Zhang, Alibaba Group, Alibaba-Zhejiang University Joint Institute of Frontier Technologies, Zhejiang University; Jianying Wang, Xuntao
Cheng, and Hao Xu, Alibaba Group; Nanlong Yu, Alibaba-Zhejiang University Joint Institute of Frontier Technologies, Zhejiang University; Gui Huang, Tieying Zhang, Dengcheng He, Feifei Li, and Wei Cao, Alibaba Group; Zhongdong Huang and Jianling Sun, Alibaba-Zhejiang University Joint Institute of Frontier Technologies, Zhejiang University
Abstract: In-memory key-value stores (KVSes) are widely used to cache hot data, in order to solve the hotspot issue in disk-based storage or distributed
systems. The hotspot issue inside in-memory KVSes is however being overlooked. Due to the recent trend that hotspot issue becomes more serious, the lack of hotspot-awareness in existing KVSes make them poorly performed and unreliable on highly skewed workloads. In this paper, we explore hotspot-aware designs for in-memory index structures in KVSes. We first analyze the potential benefits from ideal hotspot-aware indexes, and discuss challenges (i.e., hotspot shift and concurrent access issues) in effectively leveraging hotspot-awareness. Based on these insights, we propose a novel hotspot-aware KVS, named HotRing, that is optimized for massively concurrent accesses to a small portion of items. HotRing is based on an ordered-ring hash index structure, which provides fast access to hot items by moving head pointers closer to them. It also applies a lightweight strategy to detect hotspot shifts at run-time. HotRing comprehensively adopts lock-free structures in its design, for both common operations (i.e., read, update) and HotRing-specific operations (i.e., hotspot shift detection, head pointer movement and ordered-ring rehash), so that massively concurrent requests can better leverage multi-core architectures. The extensive experiments show that our approach is able to achieve 2.58× improvement compared to other in-memory KVSes on highly skewed workloads.
备忘录
内存中的KVS被广泛用于缓存热数据,这解决了磁盘节奏的存储和分布式系统中的热点问题。 另一方面,内存KVS内部的热点问题已被忽略。 近年来,热点问题变得更加严重,现有KVS中缺乏热点意识(hotspot-awareness)会损害高度偏斜(highly skewed)的工作负载中的性能和可靠性。
在本文中,我们探索了KVS中热点感知(hotspot-aware)的内存中索引结构的设计。首先,我们将分析理想的热点感知索引的潜在优势,并讨论有效利用该指标所面临的挑战(即热点转移和并发访问问题 | hotspot shift and concurrent access issues)。基于这些考虑,我们提出了一种新型的可感知热点的KVS,称为HotRing。 HotRing经过优化,可以大规模并行访问少量物品。 HotRing使用一种称为有序环哈希索引结构的技术,该技术通过将头指针移到更靠近热点的地方来提供对热点的快速访问。 HotRing还应用了轻量级策略,该策略可在运行时检测热点变化。 HotRing采用无锁结构进行常规处理(即读取,更新)和特定于HotRing的处理(即热点移位检测,头指针移动和有序环重新哈希),从而可以进行大规模并行请求。可以更好地利用多核。
实验表明,与其他内存KVS相比,该方法在工作负载高度偏斜的情况下提高了2.58倍。
杂记
highly skewed workload:也许是指将大型并行请求偏向特定项目的工作负载。
hotspot shift:热点在运行时是否发生变化?
Authors: Shucheng Wang, Ziyi Lu, and Qiang Cao, Wuhan National Laboratory for Optoelectronics, Key Laboratory of Information Storage System; Hong Jiang, Department of Computer Science and Engineering, University of Texas at Arlington; Jie Yao, School of Computer Science and Technology, Huazhong
University of Science and Technology; Yuanyuan Dong and Puyuan Yang, Alibaba Group
Abstract: Hybrid Storage servers combining high-speed SSDs and high-capacity HDDs are designed for high cost-effectiveness and provide μs-level
responsiveness for applications. Observations from the production hybrid cloud storage system Pangu suggest that HDDs are often severely underutilized while SSDs are overused, especially for writes that dominate the hybrid storage. This lopsided utilization between HDDs and SSDs leads to not only fast wear-out in the latter but also very high tail latency due to frequent garbage collections induced by intensive writes to the latter. On the other hand, our extensive experimental study reveals that a series of sequential and continuous writes to HDDs exhibit a periodic, staircase shaped pattern of write latency, i.e., low (e.g., 35μs), middle (e.g., 55μs), and high latency (e.g., 12ms), resulting from buffered writes in HDD’s controller. This suggests that HDDs can potentially provide μs-level write IO delay (for appropriately scheduled writes), which is close to SSDs’ write performance. These observations inspire us to effectively exploit this performance potential of HDDs to absorb as many writes as possible to avoid SSD overuse without performance degradation.
To achieve this goal, we first characterize performance behaviors of hybrid storage in general and its HDDs in particular. Based on the findings on
sequential and continuous writes, we propose a prediction model to accurately determine next write latency state (i.e., fast, middle and slow). With this
model, a Buffer-Controlled Write approach, BCW, is proposed to proactively and effectively control buffered writes so that low- and mid-latency periods in HDDs are scheduled with application write data and high-latency periods are filled with padded data. Based on BCW, we design a mixed IO scheduler (MIOS) to adaptively steer incoming data to SSDs and HDDs according to write patterns, runtime queue lengths, and disk status. We perform extensive evaluations under production workloads and benchmarks. The results show that MIOS removes up to 93% amount of data written to SSDs, reduces average and 99th-percentile latencies of the hybrid server by 65% and 85% respectively.
备忘录
混合存储服务器结合了高速SSD和大容量HDD,旨在提供高性能成本和微秒级响应能力。 对用于生产的混合云存储系统Pangu的观察表明,SSD过度使用,而HDD经常未被充分利用。 对于在混合存储中占主导地位的写操作尤其如此。
pangu 是阿里云的分布式文件系统
“Pangu – The High Performance Distributed File System by Alibaba Cloud”
https://www.alibabacloud.com/blog/pangu-the-high-performance-distributed-file-system-by-alibaba-cloud_594059
这种失真的利用率不仅会使SSD很快磨损,而且由于对SSD的大量写入导致频繁的GC,产生了非常长的尾部延迟( tail latency )。 另一方面,根据我们的研究,sequential对continous的HDD的write在其latency中周期性地表示阶梯状的图案(i.e.e.low(e.g.35μs)、middle(e.g.55μs)、and high latency(e.g.12ms)),这是来自HDD控制器的buffered writes。 这表明HDD可以提供微秒级的写IO延迟 (for appropriately scheduled writes),其延迟可能接近HDD的写性能。 从这些观察中,我们想到了通过充分利用HDD的潜在性能来避免SSD过度使用,同时保持性能的想法。
为了实现这一点,首先将描述混合存储一般中特别是HDD的性能特性。基于在sequential和continuous的发现,我们提出了一个精确确定下一个latency state(i.e.fast,middle和slow)的prediction model。与这个模型一起,提出Buffer-Controlled Write aproach(BCW)。BCW通过积极有效地控制buffered writes,low-and mid-latency periods in HDD are scheduled with aplication write data和high-latency periods are filed with padded data。基于BCW,我们设计了一个混合IO调度程序(MIOS),可根据写入模式,运行时队列长度和磁盘状态来自适应地管理SSD和HDD的传入数据。使用生产工作负载和基准测试的评估实验表明,MIOS最多可将93%的数据写入SSD,并将混合服务器的平均和等待时间分别降低到65%和85%。
杂记
Authors: Ao Wang and Jingyuan Zhang, George Mason University; Xiaolong Ma,University of Nevada, Reno; Ali Anwar, Lukas Rupprecht, Dimitrios Skourtis, and Vasily Tarasov, IBM Research–Almaden; Feng Yan, University of Nevada, Reno; Yue Cheng, George Mason University
Abstract: Internet-scale web applications are becoming increasingly storage-intensive and rely heavily on in-memory object caching to attain
required I/O performance. We argue that the emerging serverless computing paradigm provides a well-suited, cost-effective platform for object caching. We present InfiniCache, a first-of-its-kind in-memory object caching system that is completely built and deployed atop ephemeral serverless functions. InfiniCache exploits and orchestrates serverless functions’ memory resources to enable elastic pay-per-use caching. InfiniCache’s design combines erasure coding, intelligent billed duration control, and an efficient data backup mechanism to maximize data availability and cost-effectiveness while balancing the risk of losing cached state and performance. We implement InfiniCache on AWS Lambda and show that it: (1) achieves 31 – 96× tenant-side cost savings compared to AWS ElastiCache for a large-object-only production workload, (2) can effectively provide 95.4% data availability for each one hour window, and (3) enables comparative performance seen in a typical in-memory cache.
备忘录
Internet规模的Web应用程序正变得越来越为存储密集型 storage-intensive,并且在很大程度上依赖于内存中对象缓存来实现所需的I / O性能。 我们认为无服务器计算范例是一种适合对象缓存的且经济高效的平台。
我们建议使用InfiniCache。 InfiniCache是第一个在短暂的无服务器功能之上构建和部署的内存对象缓存系统。 InfiniCache利用并编排无服务器功能的内存资源,以实现按使用付费的弹性缓存。
InfiniCache被设计成通过组合erasure control,intelligent billd duration control,and an efficient data backup mechannism,在失去缓存状态的风险和性能的平衡的同时,最大化数据的可用性和成本效益。
我们在AWS Lambda上实现了InfiniCache,并显示:
(2) can effectively provide 95.4% data availability for each one hour window, and
杂记
AWS Lambda
在上面创建一个缓存系统意味着什么? 实例可以重用吗?
虽然是Ephemeral serverless functions上的缓存系统,但是名字是InfiniCache吗? 如果自动缩放,是否可以半无限地缓存? ??
Authors: Abhishek Vijaya Kumar and Muthian Sivathanu, Microsoft Research India
Abstract: We introduce Quiver, an informed storage cache for deep learning training (DLT) jobs in a cluster of GPUs. Quiver employs domain-specific
intelligence within the caching layer, to achieve much higher efficieny compared to a generic storage cache. First, Quiver uses a secure hash-based addressing to transparently reuse cached data across multiple jobs and even multiple users operating on the same dataset. Second, by co-designing with the deep learning framework (\eg, PyTorch), Quiver employs a technique of {\em substitutable cache hits} to get more value from the existing contents of the cache, thus avoiding cache thrashing when cache capacity is much smaller than the working set. Third, Quiver dynamically prioritizes cache allocation to jobs that benefit the most from the caching. With a prototype implementation in PyTorch, we show that Quiver can significantly improve throughput of deep learning workloads.
备忘录
提出Quiver。Quiver是用于GPU集群中的deep learning training(DLT)作业的informed store cache。Quiver在其缓存层中利用了domain-specific intelligence,以实现比通用存储缓存更高效。
第一,Quiver使用secure hash-based addressing。由此,即使在处理多个作业或相同数据集的多个用户之间也能够透明地再利用高速缓存数据。
第二,与深度学习框架(e.g.,PyTorch)协调设计(“by co-designing with the deep learning frame work(\eg,PyTorch)”)
因此,Quiver将利用a technique of×substitable cachets从现有的缓存内容中获得进一步的价值。
第三,Quiver动态地优先分配高速缓存以获得高速缓存的优点的工作。使用PyTorch中的原型实现,Quiver表示可以显著改善深学习工作负载的吞吐量。
杂记
Authors: Zizhong Wang, Tongliang Li, Haixia Wang, Airan Shao, Yunren Bai, Shangming Cai, Zihan Xu, and Dongsheng Wang, Tsinghua University
Abstract: Consensus protocols can provide highly reliable and available distributed services. In these protocols, log entries are completely replicated
to all servers. This complete-entry replication causes high storage and network costs, which harms performance.
Erasure coding is a common technique to reduce storage and network costs while keeping the same fault tolerance ability. If the complete-entry replication in consensus protocols can be replaced with an erasure coding replication, storage and network costs can be greatly reduced. RS-Paxos is the first consensus protocol to support erasure-coded data, but it has much poorer availability compared to commonly used consensus protocols, like Paxos and Raft. We point out RS-Paxos’s liveness problem and try to solve it. Based on Raft, we present a new protocol, CRaft. Providing two different replication methods, CRaft can use erasure coding to save storage and network costs like RS-Paxos, while it also keeps the same liveness as Raft.
To demonstrate the benefits of our protocols, we built a key-value store based on CRaft, and evaluated it. In our experiments, CRaft could save 66% of storage, reach a 250% improvement on write throughput and reduce 60.8% of write latency compared to original Raft.
备忘录
共识协议在分布式服务中提供了高可靠性和可用性。 在这些协议中,日志条目被完全复制到所有服务器。 这种完整的复制会增加存储和网络成本,并对性能产生不利影响。
erasure code是一种在保持容错能力的同时降低存储和网络成本的常用技术。 如果用erasure code代替日志条目的完全复制,则可以大大降低存储和网络成本。
RS-Paxos是第一个支持erasure code数据的共识协议,但是它比常用的共识算法(例如Paxos和Raft)可用的要少得多。 作者指出了RS-Paxos的活动性问题,并开始解决它。 基于Raft,作者提出了一种新协议CRaft。 CRaft保持与Raft相同的活动性,但对RS-Paxos使用擦除编码以减少存储和网络成本。
为了演示该协议的好处,我们创建了一个基于CRaft的键值存储并对其进行了评估。 实验表明,与原始Raft相比,CRaft提高了写入吞吐量和延迟。
Authors: Rekha Pitchumani and Yang-suk Kee, Memory Solutions Lab, Samsung Semiconductor Inc.
Abstract: Rapid growth in data storage technologies created the modern data-driven world. Modern workloads and application have influenced the
evolution of storage devices from simple block devices to more intelligent object devices. Emerging, next-generation Key-Value (KV) storage devices allow
storage and retrieval of variable-length user data directly onto the devices and can be addressed by user-desired variable-length keys. Traditional reliability schemes for multiple block storage devices, such as Redundant Array of Independent Disks (RAID), have been around for a long time and used by most systems with multiple devices.
Now, the question arises as to what an equivalent for such emerging object devices would look like, and how it would compare against the traditional
mechanism. In this paper, we present Key-Value Multi-Device (KVMD), a hybrid data reliability manager that employs a variety of reliability techniques with different trade-offs, for key-value devices. We present three stateless reliability techniques suitable for variable length values, and evaluate the
hybrid data reliability mechanism employing these techniques using KV SSDs from Samsung. Our evaluation shows that, compared to Linux mdadm-based RAID throughput degradation for block devices, data reliability for KV devices can be achieved at a comparable or lower throughput degradation. In addition, the KV API enables much quicker rebuild and recovery of failed devices, and also allows for both hybrid reliability configuration set automatically based on, say, value sizes, and custom per-object reliability configuration for user data.
备忘录
数据存储技术的迅速发展创造了一个现代的数据驱动世界。 现代工作负载和应用程序已经影响了存储设备,从简单的块设备演变为更智能的对象设备。
下一代键值存储设备允许将可变长度的用户数据直接存储在设备上并从中检索数据,这些数据可以由用户所需的可变长度密钥指定。
用于多块存储的经典可靠性保证技术(例如RAID)已经使用了很多年。 那么在下一代对象设备中它是什么样的呢? 与经典机制有何不同?
在本文中,我们提出了Key-Value Multi-Device (KVMD)。 KVMD是一种混合数据可靠性管理器,它对键值设备采用了各种具有不同权衡取舍的可靠性技术。 本文介绍了三种适用于可变长度数据的无状态可靠性技术,并通过三星的KV SSD使用这些技术评估了混合数据可靠性机制(hybrid data reliability mechanism )。
我们的评估表明,与块设备基于Linux mdadm的RAID吞吐量下降相比,KV设备的数据可靠性可以在相当或更低的吞吐量下降下实现。 此外,KV API可以更快地重建和恢复故障设备,并且还可以基于值大小自动设置混合可靠性配置,还可以为用户数据自定义按对象可靠性配置。
杂记
像三星一样的KV SSD是否应该用作对象存储的存储设备? 与普通存储设备有什么不同?
“Key Value SSD Explained - Concept, Device, System, and Standard”
(2017/09/14)
https://www.snia.org/sites/default/files/SDC/2017/presentations/Object_ObjectDriveStorage/Ki_Yang_Seok_Key_Value_SSD_Explained_Concept_Device_System_and_Standard.pdf
普通的KV-store 应用程序构建在块设备/文件系统上,但是KV SSD似乎提供了一个API,可让您跳过并访问该设备。
什么是mdadm?
Authors: Aishwarya Ganesan, Ramnatthan Alagappan, Andrea Arpaci-Dusseau, and Remzi Arpaci-Dusseau, University of Wisconsin–Madison
Abstract: We introduce consistency-aware durability or CAD, a new approach to durability in distributed storage that enables strong consistency while
delivering high performance. We demonstrate the efficacy of this approach by designing cross-client monotonic reads, a novel and strong consistency property that provides monotonic reads across failures and sessions in leader-based systems. We build ORCA, a modified version of ZooKeeper that implements CAD and cross-client monotonic reads. We experimentally show that ORCA provides strong consistency while closely matching the performance of weakly consistent ZooKeeper. Compared to strongly consistent ZooKeeper, ORCA provides significantly higher throughput (1.8 – 3.3×), and notably reduces latency, sometimes by an order of magnitude in geo-distributed settings.
备忘录
Consistency-Aware Durability (CAD) 提案。 CAD是提高分布式系统耐用性的一种新方法,可在实现高性能的同时仍提供强大的一致性。
我们通过设计跨客户端单调读取来证明此方法的有效性,这种新颖而强大的一致性属性可在基于领导者的系统中跨故障和会话提供单调读取(monotonic reads)。
我修改了ZooKeeper,以创建具有CAD和跨客户端单调读取(cross-client monotonic reads)的ORCA。 ORCA通过实验证明,它在提供强大一致性的同时,提供的性能与ZooKeeper的性能(弱一致性)相近。 与高度一致的ZooKeeper相比,ORCA提供了更高的吞吐量(1.8-3.3倍)和更低的延迟。
杂记
Apache ZooKeeper