The purpose of this document is to describe the internal code structure and major algorithms used by DAOS. It assumes prior knowledge of the DAOS storage model and acronyms. This document contains the following sections:
// 本文的目的是描述DAOS使用的内部代码结构和主要算法。它假定对DAOS存储模型和首字母缩略词有先验知识。本文件包含以下部分:
As illustrated in the diagram below, a DAOS installation involves several components that can be either colocated or distributed. The DAOS software-defined storage (SDS) framework relies on two different communication channels: an out-of-band TCP/IP network for management and a high-performant fabric for data access. In practice, the same network can be used for both management and data access. IP over fabric can also be used as the management network.
//如下图所示,DAOS安装涉及多个组件,这些组件可以是同一位置的,也可以是分布式的。DAOS软件定义存储(SDS)框架依赖于两种不同的通信通道:用于管理的带外TCP/IP网络和用于数据访问的高性能fabric。实际上,同一个网络可以用于管理和数据访问。IP over fabric也可用作管理网络。
DAOS System
A DAOS server is a multi-tenant daemon running on a Linux instance (i.e. physical node, VM or container) and managing the locally-attached SCM and NVM storage allocated to DAOS. It listens to a management port, addressed by an IP address and a TCP port number, plus one or more fabric endpoints, addressed by network URIs. The DAOS server is configured through a YAML file (/etc/daos/daos_server.yml, or a different path provided on the command line). Starting and stopping the DAOS server can be integrated with different daemon management or orchestration frameworks (e.g. a systemd script, a Kubernetes service or even via a parallel launcher like pdsh or srun).
//DAOS服务器是一个多租户守护进程,运行在Linux实例(即物理节点、VM或容器)上,管理分配给DAOS的本地连接的SCM和NVM存储。它侦听由IP地址和TCP端口号寻址的管理端口,以及由网络URI寻址的一个或多个fabric端点。DAOS服务器是通过YAML文件(/etc/DAOS/DAOS_server.yml,或命令行提供的其他路径)配置的。启动和停止DAOS服务器可以与不同的守护进程管理或编排框架(例如systemd脚本、Kubernetes服务,甚至通过pdsh或srun之类的并行启动程序)集成。
A DAOS system is identified by a system name and consists of a set of DAOS servers connected to the same fabric. Two different systems comprise two disjoint sets of servers and do not coordinate with each other. DAOS pools cannot span across multiple systems.
//DAOS系统由一个系统名标识,它由一组连接到同一结构的DAOS服务器组成。两个不同的系统由两组不相交的服务器组成,彼此不协调。DAOS池不能跨多个系统。
Internally, a DAOS server is composed of multiple daemon processes. The first one to be started is the control plane (binary named daos_server) which is responsible for parsing the configuration file, provisionning storage and eventually starting and monitoring one or multiple instances of the data plane (binary named daos_engine). The control plane is written in Go and implements the DAOS management API over the gRPC framework that provides a secured out-of-band channel to administrate a DAOS system. The number of data plane instances to be started by each server as well as the storage, CPU and fabric interface affinity can be configured through the daos_server.yml YAML configuration file.
//在内部,DAOS服务器由多个守护进程组成。第一个要启动的是控制平面(二进制名为daos_server),它负责解析配置文件、配置存储并最终启动和监视数据平面的一个或多个实例(二进制名为daos_engine)。控制平面用Go编写,并在gRPC框架上实现DAOS管理API,该框架提供了一个安全的带外通道来管理DAOS系统。每个服务器要启动的数据平面实例的数量以及存储、CPU和结构接口关联可以通过daos_server.yml YAML配置文件进行配置。
The data plane is a multi-threaded process written in C that runs the DAOS storage engine. It processes incoming metadata and I/O requests though the CART communication middleware and accesses local NVM storage via the PMDK (for storage-class memory, aka SCM) and SPDK (for NVMe SSDs) libraries. The data plane relies on Argobots for event-based parallel processing and exports multiple targets that can be independently addressed via the fabric. Each data plane instance is assigned a unique rank inside a DAOS system.
//数据平面是一个用C编写的多线程进程,它运行DAOS存储引擎。它通过CART通信中间件处理传入的元数据和I/O请求,并通过PMDK(用于存储类内存,又称SCM)和SPDK(用于NVMe ssd)库访问本地NVM存储。数据平面依赖于argobot进行基于事件的并行处理,并导出可通过fabric独立寻址的多个目标。在DAOS系统中,每个数据平面实例都被分配一个唯一的rank值。
The control plane and data plane processes communicate locally through Unix Domain Sockets and a custom lightweight protocol called dRPC.
//控制平面和数据平面进程通过Unix域套接字和称为dRPC的定制轻量级协议进行本地通信。
For further reading:
Applications, users and administrators can interact with a DAOS system through two different client APIs.
//应用程序、用户和管理员可以通过两个不同的客户端api与DAOS系统交互。
The DAOS management Go package allows to administrate a DAOS system from any nodes that can communicate with the DAOS servers through the out-of-band management channel. This API is reserved for the DAOS system administrators who are authenticated through a specific certificate. The DAOS management API is intended to be integrated with different vendor-specific storage management or open-source orchestration frameworks. A CLI tool called dmg is built over the DAOS management API. For further reading on the management API and the dmg tool:
//DAOS management Go包允许从任何(可以通过带外管理通道与DAOS服务器通信的)节点管理DAOS系统。此API保留给通过特定证书进行身份验证的DAOS系统管理员。DAOS管理API旨在与不同的特定于供应商的存储管理或开源编排框架集成。名为dmg的CLI工具是在DAOS管理API上构建的。有关管理API和dmg工具的进一步阅读:
The DAOS library (libdaos) implements the DAOS storage model and is primarily targeted at application and I/O middleware developers who want to store datasets into DAOS containers. It can be used from any nodes connected to the fabric used by the targeted DAOS system. The application process is authenticated via the DAOS agent (see next section). The API exported by libdaos is commonly called the DAOS API (in contrast to the DAOS management API) and allows to manage containers and access DAOS objects through different interfaces (e.g. key-value store or array API). The libdfs library emulates POSIX file and directory abstractions over libdaos and provides a smooth migration path for applications that require a POSIX namespace. For further reading on libdaos, bindings for different programming languages and libdfs:
//DAOS库(libdaos)实现DAOS存储模型,主要面向希望将数据集存储到DAOS容器中的应用程序和I/O中间件开发人员。它可以从连接到目标DAOS系统使用的结构的任何节点使用。应用程序进程通过DAOS代理进行身份验证(请参阅下一节)。libdaos导出的API通常称为DAOS API(与DAOS管理API不同),允许通过不同的接口(例如键值存储或数组API)管理容器和访问DAOS对象。libdfs库模拟libdaos上的POSIX文件和目录抽象,并为需要POSIX命名空间的应用程序提供平滑的迁移路径。有关libdaos、不同编程语言和libdf的绑定的进一步阅读:
The libdaos and libdfs libraries provide the foundation to support domain-specific data formats like HDF5 and Apache Arrow. For further reading on I/O middleware integration, please check the following external references:
//libdaos 和libdfs 库为支持特定领域的数据格式(如HDF5和Apache Arrow)提供了基础。有关I/O中间件集成的进一步阅读,请查看以下外部参考资料:
The DAOS agent is a daemon residing on the client nodes. It interacts with the DAOS client library through dRPC to authenticate the application process. It is a trusted entity that can sign the DAOS Client credentials using local certificates. The DAOS agent can support different authentication frameworks and uses a Unix Domain Socket to communicate with the client library. The DAOS agent is written in Go and communicates through gRPC with the control plane component of each DAOS server to provide DAOS system membership information to the client library and to support pool listing.
//DAOS代理是驻留在客户机节点上的守护程序。它通过dRPC与DAOS客户机库交互,以验证应用程序进程。它是一个可信任的实体,可以使用本地证书对DAOS客户端凭据进行签名。DAOS代理可以支持不同的身份验证框架,并使用Unix域套接字与客户机库通信。DAOS代理用Go编写,通过gRPC与每个DAOS服务器的控制平面组件进行通信,向客户机库提供DAOS系统成员信息,并支持池列表。
As introduced in the previous section, DAOS uses three different communication channels.
gRPC provides a bi-directional secured channel for DAOS management. It relies on TLS/SSL to authenticate the administrator role and the servers. Protocol buffers are used for RPC serialization and all proto files are located in the proto directory. //gRPC为DAOS管理提供了双向安全通道。它依赖于TLS/SSL来验证管理员角色和服务器。协议缓冲区用于RPC序列化,所有proto文件都位于proto目录中。
dRPC is communication channel built over Unix Domain Socket that is used for inter-process communications. It provides both a C and Go interface to support interactions between:
//dRPC是在Unix域套接字上构建的用于进程间通信的通信通道。它提供了C和Go接口,以支持以下两者之间的交互:
CART is a userspace function shipping library that provides low-latency high-bandwidth communications for the DAOS data plane. It supports RDMA capabilities and scalable collective operations. CART is built over Mercury and libfabric. The CART library is used for all communications between libdaos and daos_engine instances.
// CART是一个用户空间函数传递库,它为DAOS数据平面提供低延迟高带宽通信。它支持RDMA功能和可扩展的集合操作。手推车是在Mercury (水银)和libfabric上建造的。CART库用于libdaos和daos_engine 实例之间的所有通信。
As shown in the diagram below, the DAOS stack is structured as a collection of storage services over a client/server architecture. Examples of DAOS services are the pool, container, object and rebuild services.
//如下图所示,DAOS堆栈被构造为client/server体系结构上的存储服务集合。DAOS服务的例子有池、容器、对象和重建服务。
A DAOS service can be spread across the control and data planes and communicate internally through dRPC. Most services have client and server components that can synchronize through gRPC or CART. Cross-service communications are always done through direct API calls. Those function calls can be invoked across either the client or server component of the services. While each DAOS service is designed to be fairly autonomous and isolated, some are more tightly coupled than others. That is typically the case of the rebuild service that needs to interact closely with the pool, container and object services to restore data redundancy after a DAOS server failure.
//DAOS服务可以分布在控制和数据平面上,并通过dRPC进行内部通信。大多数服务都有客户端和服务器组件,可以通过gRPC或CART进行同步。跨服务通信总是通过直接的API调用来完成的。这些函数调用可以跨服务的客户端或服务器组件调用。虽然每个DAOS服务都被设计成相当自治和隔离的,但有些服务的耦合比其他服务更紧密。这通常是重建服务的情况,它需要与池、容器和对象服务密切交互,以便在DAOS服务器发生故障后恢复数据冗余。
While the service-based architecture offers flexibility and extensibility, it is combined with a set of infrastucture libraries that provide a rich software ecosystem (e.g. communications, persistent storage access, asynchronous task execution with dependency graph, accelerator support, ...) accessible to all the DAOS services.
//基于服务的体系结构提供了灵活性和可扩展性,它与一组基础结构库相结合,这些库提供了可供所有DAOS服务访问的丰富软件生态系统(例如,通信、持久存储访问、具有依赖关系图的异步任务执行、加速器支持等)。
Each infrastructure library and service is allocated a dedicated directory under src/. The client and server components of a service are stored in separate files. Functions that are part of the client component are prefixed with dc\_ (stands for DAOS Client) whereas server-side functions use the ds\_ prefix (stands for DAOS Server). The protocol and RPC format used between the client and server components is usually defined in a header file named rpc.h.
//每个基础结构库和服务在src/下分配一个专用目录。服务的客户机和服务器组件存储在不同的文件中。作为客户机组件的一部分的函数的前缀是dc \(代表DAOS client),而服务器端函数的前缀是ds \(代表DAOS server)。客户端和服务器组件之间使用的协议和RPC格式通常在名为RPC.h的头文件中定义。
All the Go code executed in context of the control plane is located under src/control. Management and security are the services spread across the control (Go language) and data (C language) planes and communicating internally through dRPC. //在控制平面上下文中执行的所有Go代码都位于src/control下。管理和安全是分布在控制(Go语言)和数据(C语言)平面上的服务,通过dRPC进行内部通信。
Headers for the official DAOS API exposed to the end user (i.e. I/O middleware or application developers) are under src/include and use the daos\_ prefix. Each infrastructure library exports an API that is available under src/include/daos and can be used by any services. The client-side API (with dc\_ prefix) exported by a given service is also stored under src/include/daos whereas the server-side interfaces (with ds\_ prefix) are under src/include/daos_srv.
// 向最终用户(即i/O中间件或应用程序开发人员)公开的官方DAOSAPI的标头位于src/include下,并使用DAOS\前缀。每个基础结构库导出一个API,该API在src/include/daos下可用,可以由任何服务使用。给定服务导出的客户端API(带有dc前缀)也存储在src/include/daos下,而服务器端接口(带有ds前缀)则存储在src/include/daos\srv下。
The GURT and common DAOS (i.e. libdaos\_common) libraries provide logging, debugging and common data structures (e.g. hash table, btree, ...) to the DAOS services.
//GURT和common DAOS(即libdaos\_ common)库为DAOS服务提供日志记录、调试和公共数据结构(如哈希表、btree等)。
Local NVM storage is managed by the Versioning Object Store (VOS) and blob I/O (BIO) libraries. VOS implements the persistent index in SCM whereas BIO is responsible for storing application data in either NVMe SSD or SCM depending on the allocation strategy. The VEA layer is integrated into VOS and manages block allocation on NVMe SSDs.
//本地NVM存储由版本对象存储(VOS)和blob I/O(BIO)库管理。VOS在SCM中实现持久索引,而BIO负责根据分配策略将应用程序数据存储在NVMe SSD或SCM中。VEA层集成到VOS中,并管理NVMe ssd上的块分配。
DAOS objects are distributed across multiple targets for both performance (i.e. sharding) and resilience (i.e. replication or erasure code). The placement library implements different algorithms (e.g. ring-based placement, jump consistent hash, ...) to generate the layout of an object from the list of targets and the object identifier.
//DAOS对象分布在多个目标上,以实现性能(即分片)和恢复能力(即复制或擦除代码)。placement库实现不同的算法(例如,基于环的放置、跳转一致性散列…)从目标列表和对象标识符生成对象的布局。
The replicated service (RSVC) library finally provides some common code to support fault tolerance. This is used by the pool, container & management services in conjunction with the RDB library that implements a replicated key-value store over Raft.
//复制服务(RSVC)库最终提供了一些支持容错的公共代码。池、容器和管理服务与RDB库结合使用,RDB库在Raft上实现复制的键值存储。
For further reading on those infrastructure libraries, please see: 有关这些基础结构库的更多信息,请参阅:
The diagram below shows the internal layering of the DAOS services and interactions with the different libraries mentioned above.
//下图显示了DAOS服务的内部分层以及与上面提到的不同库的交互。
Vertical boxes represent DAOS services whereas horizontal ones are for infrastructure libraries.
// 垂直框表示DAOS服务,而水平框表示基础结构库。
For further reading on the internals of each service:
Interoperability in DAOS is handled via protocol and schema versioning for persistent data structures.
// DAOS中的互操作性是通过持久数据结构的协议和模式版本控制来处理的。
Limited protocol interoperability is to be provided by the DAOS storage stack. Version compatibility checks will be performed to verify that: //DAOS存储堆栈将提供有限的协议互操作性。将执行版本兼容性检查以验证:
If a protocol version mismatch is detected among storage targets in the same pool, the entire DAOS system will fail to start up and will report failure to the control API. Similarly, connection from clients running a protocol version incompatible with the targets will return an error.
//如果在同一池中的存储目标之间检测到协议版本不匹配,则整个DAOS系统将无法启动,并将向控制API报告失败。类似地,来自运行与目标不兼容的协议版本的客户端的连接将返回错误。
The schema of persistent data structures may evolve from time to time to fix bugs, add new optimizations or support new features. To that end, the persistent data structures support schema versioning.
Upgrading the schema version is not done automatically and must be initiated by the administrator. A dedicated upgrade tool will be provided to upgrade the schema version to the latest one. All targets in the same pool must have the same schema version. Version checks are performed at system initialization time to enforce this constraint.
To limit the validation matrix, each new DAOS release will be published with a list of supported schema versions. To run with the new DAOS release, administrators will then need to upgrade the DAOS system to one of the supported schema version. New target will always be reformatted with the latest version. This versioning schema only applies to data structure stored in persistent memory and not to block storage that only stores user data with no metadata.
// 持久数据结构的模式可能会不时地演变,以修复错误、添加新的优化或支持新的特性。为此,持久数据结构支持模式版本控制。
升级模式版本不是自动完成的,必须由管理员启动。将提供一个专用的升级工具,将模式版本升级到最新版本。同一池中的所有目标必须具有相同的模式版本。版本检查在系统初始化时执行,以强制执行此约束。
为了限制验证矩阵,每个新的DAOS版本都将发布一个受支持的模式版本列表。要使用新的DAOS版本运行,管理员需要将DAOS系统升级到受支持的模式版本之一。新target将始终使用最新版本重新格式化。此版本控制模式仅适用于存储在持久内存中的数据结构,而不适用于只存储用户数据而不存储元数据的块存储。
未完待续......