1. Introduction(简介)(请读自行阅读)
This document provides software architecture information, development environment information and optimization guidelines.
本文档提供了软件架构信息、开发环境信息和优化指南。
For programming examples and for instructions on compiling and running each sample application, see the DPDK Sample Applications User Guide for details.
有关编程示例以及编译和运行每个示例应用程序的说明,请参阅DPDK示例应用程序用户指南了解详细信息。
For general information on compiling and running applications, see the DPDK Getting Started Guide.
有关编译和运行应用程序的一般信息,请参阅DPDK入门指南。
链接如下:
http://doc.dpdk.org/guides/prog_guide/intro.html#related-publications
2. Overview(概述)
This section gives a global overview of the architecture of Data Plane Development Kit (DPDK).
本节对数据平面开发工具包(DPDK)的体系结构进行了总体概述。
The main goal of the DPDK is to provide a simple, complete framework for fast packet processing in data plane applications. Users may use the code to understand some of the techniques employed, to build upon for prototyping or to add their own protocol stacks. Alternative ecosystem options that use the DPDK are available.
DPDK的主要目标是为数据平面应用程序中的快速数据包处理提供一个简单、完整的框架。用户可以使用这些代码来理解所使用的一些技术,以构建原型或添加自己的协议栈。可以使用DPDK的其他生态系统选项。
The framework creates a set of libraries for specific environments through the creation of an Environment Abstraction Layer (EAL), which may be specific to a mode of the Intel® architecture (32-bit or 64-bit), Linux* user space compilers or a specific platform. These environments are created through the use of make files and configuration files. Once the EAL library is created, the user may link with the library to create their own applications. Other libraries, outside of EAL, including the Hash, Longest Prefix Match (LPM) and rings libraries are also provided. Sample applications are provided to help show the user how to use various features of the DPDK.
框架创建一组特定的环境的库的集合从而创建了一个环境抽象层(EAL),此层可能是特定于Intel®体系结构模式(32位或64位),Linux用户空间*编译器或者一个特定的平台。这些环境是通过使用make文件和配置文件创建的。一旦创建了EAL库,用户可以链接该库来创建自己的应用程序。除了EAL之外,还提供了其他库,包括哈希、最长前缀匹配(LPM)和环库。本文提供了示例应用程序,以帮助用户了解如何使用DPDK的各种特性。
The DPDK implements a run to completion model for packet processing, where all resources must be allocated prior to calling Data Plane applications, running as execution units on logical processing cores. The model does not support a scheduler and all devices are accessed by polling. The primary reason for not using interrupts is the performance overhead imposed by interrupt processing.
DPDK为包处理实现了一个从运行到完成的模型,在这个模型中,在调用数据平面应用程序之前,必须分配所有资源,这些资源作为执行单元在逻辑处理核心上运行。该模型不支持调度程序,所有设备都通过轮询访问。不使用中断的主要原因是中断处理带来的性能开销。
In addition to the run-to-completion model, a pipeline model may also be used by passing packets or messages between cores via the rings. This allows work to be performed in stages and may allow more efficient use of code on cores.
除了运行到完成模型之外,管道模型还可用于通过环在内核之间传递数据包或消息。这允许分阶段执行工作,并允许更有效地在核心上使用代码。
2.1. Development Environment(开发环境)
The DPDK project installation requires Linux and the associated toolchain, such as one or more compilers, assembler, make utility, editor and various libraries to create the DPDK components and libraries.
DPDK项目的安装需要Linux和相关的工具链,如一个或多个编译器、汇编器、make实用程序、编辑器和各种库来创建DPDK组件和库。
Once these libraries are created for the specific environment and architecture, they may then be used to create the user’s data plane application.
一旦为特定的环境和体系结构创建了这些库,就可以使用它们来创建用户的数据平面应用程序。
When creating applications for the Linux user space, the glibc library is used. For DPDK applications, two environmental variables (RTE_SDK and RTE_TARGET) must be configured before compiling the applications. The following are examples of how the variables can be set:
在为Linux用户空间创建应用程序时,将使用glibc库。对于DPDK应用程序,在编译应用程序之前必须配置两个环境变量(RTE_SDK和RTE_TARGET)。以下是如何设置变量的例子:
2.2. Environment Abstraction Layer(环境抽象层)
The Environment Abstraction Layer (EAL) provides a generic interface that hides the environment specifics from the applications and libraries. The services provided by the EAL are:
环境抽象层(EAL)提供了一个通用接口,用于向应用程序和库隐藏环境细节。本处提供的服务包括:
译者注:CPU affinity 是一种调度属性(scheduler property), 它可以将一个进程"绑定" 到一个或一组CPU上.
The EAL is fully described in Environment Abstraction Layer.
详细地关于EAL的描述在第3章环境抽象层。
2.3. Core Components(核心组件)
The core components are a set of libraries that provide all the elements needed for high-performance packet processing applications.
核心组件是一组库,提供高性能数据包处理应用程序所需的所有元素。
2.3.1. Ring Manager (librte_ring) 环管理
The ring structure provides a lockless multi-producer, multi-consumer FIFO API in a finite size table. It has some advantages over lockless queues; easier to implement, adapted to bulk operations and faster. A ring is used by the Memory Pool Manager (librte_mempool) and may be used as a general communication mechanism between cores and/or execution blocks connected together on a logical core.
环形结构在有限大小的表中提供了无锁的多生产者、多使用者FIFO API。
它比无锁队列有一些优势;易于实现,适应批量操作,速度更快。内存池管理器(librte_mempool)使用一个环,它可以用作连接在逻辑内核上的内核和/或执行块之间的通用通信机制。
This ring buffer and its usage are fully described in Ring Library.
环行结构的详细描述在第五章环形结构库
2.3.2. Memory Pool Manager (librte_mempool)(内存池管理)
The Memory Pool Manager is responsible for allocating pools of objects in memory. A pool is identified by name and uses a ring to store free objects. It provides some other optional services, such as a per-core object cache and an alignment helper to ensure that objects are padded to spread them equally on all RAM channels.
内存池管理器负责分配内存中的对象池。池按名称标识,并使用一个环来存储空闲对象。它还提供了其他一些可选服务,例如每个核心对象缓存和对齐助手,以确保填充对象,使其在所有RAM通道上均匀分布。
This memory pool allocator is described in Mempool Library.
内存池详细描述在第六章中。
2.3.3. Network Packet Buffer Management (librte_mbuf)
网络包缓存管理
The mbuf library provides the facility to create and destroy buffers that may be used by the DPDK application to store message buffers. The message buffers are created at startup time and stored in a mempool, using the DPDK mempool library.
mbuf库提供了创建和销毁DPDK应用程序用于存储消息缓冲区的缓冲区的功能.消息缓冲区在启动时创建,并使用DPDK mempool库存储在mempool中。
This library provides an API to allocate/free mbufs, manipulate packet buffers which are used to carry network packets.
这个库提供了一个API来分配/释放mbufs,操作用于传输网络数据包的数据包缓冲区。
Network Packet Buffer Management is described in Mbuf Library.
在Mbuf库中描述了网络数据包缓冲区管理
2.3.4. Timer Manager (librte_timer)时钟管理
This library provides a timer service to DPDK execution units, providing the ability to execute a function asynchronously. It can be periodic function calls, or just a one-shot call. It uses the timer interface provided by the Environment Abstraction Layer (EAL) to get a precise time reference and can be initiated on a per-core basis as required.
他的库为DPDK执行单元提供计时器服务,提供异步执行函数的能力。它可以是周期性的函数调用,也可以是一次性调用。它使用由环境抽象层(EAL)提供的定时器接口来获得精确的时间引用,并且可以根据需要在每个核的基础上启动。库文档可在定时器库中获得。
The library documentation is available in Timer Library详细.
该库的详细描述在第19章
2.4. Ethernet Poll Mode Driver Architecture
The DPDK includes Poll Mode Drivers (PMDs) for 1 GbE, 10 GbE and 40GbE, and para virtualized virtio Ethernet controllers which are designed to work without asynchronous, interrupt-based signaling mechanisms.
DPDK包括针对1个GbE、10个GbE和40个GbE的轮询模式驱动程序(PMDs),以及para虚拟virtio以太网控制器,其用于在没有异步、基于中断的信令机制的情况下工作。
译者注:virtio 是一种 I/O 半虚拟化解决方案,是一套通用 I/O 设备虚拟化的程序,是对半虚拟化 Hypervisor
中的一组通用 I/O 设备的抽象算法的
2.5. Packet Forwarding Algorithm Support (包转发算法的支持)
The DPDK includes Hash (librte_hash) and Longest Prefix Match (LPM,librte_lpm) libraries to support the corresponding packet forwarding algorithms.
DPDK包括哈希(librte_hash)和最长前缀匹配(LPM,librte_lpm)库,以支持相应的包转发算法。
See Hash Library and LPM Library for more information.
有关更多信息,请参见哈希库(20章)和LPM库(23章)。
2.6. librte_net
The librte_net library is a collection of IP protocol definitions and convenience macros. It is based on code from the FreeBSD* IP stack and contains protocol numbers (for use in IP headers), IP-related macros, IPv4/IPv6 header structures and TCP, UDP and SCTP header structures.
librte_net库是IP协议定义和convenience macros的集合。它基于FreeBSD* IP栈的代码,包含协议编号(用于IP报头)、与IP相关的宏、IPv4/IPv6报头结构和TCP、UDP和SCTP报头结构。
3. Environment Abstraction Layer(环境抽象层)
The Environment Abstraction Layer (EAL) is responsible for gaining access to low-level resources such as hardware and memory space. It provides a generic interface that hides the environment specifics from the applications and libraries. It is the responsibility of the initialization routine to decide how to allocate these resources (that is, memory space, devices, timers, consoles, and so on).
环境抽象层(EAL)负责访问底层资源,如硬件和内存空间。它提供了一个通用接口,用于向应用程序和库隐藏环境细节。初始化例程负责决定如何分配这些资源(即内存空间、设备、计时器、控制台等等)。
Typical services expected from the EAL are:
EAL的典型服务:
3.1. EAL in a Linux-userland Execution Environment(EAL在Linux用户级环境中执行)
In a Linux user space environment, the DPDK application runs as a user-space application using the pthread library.
在Linux用户空间环境中,DPDK应用程序使用pthread库作为用户空间应用程序运行。
The EAL performs physical memory allocation using mmap() in hugetlbfs (using huge page sizes to increase performance). This memory is exposed to DPDK service layers such as the Mempool Library.
EAL在hugetlbfs文件系统中使用mmap()执行物理内存分配(使用巨大的页面大小来提高性能)。该内存公开给DPDK服务层,如Mempool库。
译者注:Linux 操作系统采用了基于 hugetlbfs 特殊文件系统
At this point, the DPDK services layer will be initialized, then through pthread setaffinity calls, each execution unit will be assigned to a specific logical core to run as a user-level thread.
此时,DPDK服务层将被初始化,然后通过 pthread setaffinity调用,将每个执行单元分配到一个特定的逻辑核心,作为用户级线程运行。
The time reference is provided by the CPU Time-Stamp Counter (TSC) or by the HPET kernel API through a mmap() call.
时间基准由CPU时间戳计数器(TSC)或HPET内核API通过mmap()调用提供。
Part of the initialization is done by the start function of glibc. A check is also performed at initialization time to ensure that the micro architecture type chosen in the config file is supported by the CPU. Then, the main() function is called. The core initialization and launch is done in rte_eal_init() (see the API documentation). It consist of calls to the pthread library (more specifically, pthread_self(), pthread_create(), and pthread_setaffinity_np()).
初始化的一部分是由glibc的start函数完成的。检查需要在初始化执行时被执行,以确保配置文件中选择的微体系结构类型得到CPU的支持。然后,调用main()函数。核心初始化和启动在rte_eal_init()中完成(请参阅API文档)。它由pthread库(更具体地说,pthread_self()、pthread_create()和pthread_setaffinity_np())的调用组成。
备注:Initialization of objects, such as memory zones, rings, memory pools, lpm tables and hash tables, should be done as part of the overall application initialization on the master lcore. The creation and initialization functions for these objects are not multi-thread safe. However, once initialized, the objects themselves can safely be used in multiple threads simultaneously.
对象(如内存区域、环、内存池、lpm表和散列表)的初始化应该作为主lcore上的整个应用程序初始化的一部分来完成。这些对象的创建和初始化函数不是多线程安全的。然而,一旦初始化,对象本身就可以安全地同时在多个线程中使用。
3.1.2. Shutdown and Cleanup
During the initialization of EAL resources such as hugepage backed memory can be allocated by core components. The memory allocated during rte_eal_init() can be released by calling the rte_eal_cleanup() function. Refer to the API documentation for details.
在EAL资源初始化过程中,可以通过核心组件来分配巨页内存等资源。在rte_eal_init()期间分配的内存可以通过调用rte_eal_cleanup()函数来释放。有关详细信息,请参阅API文档。
3.1.3. Multi-process Support
The Linuxapp EAL allows a multi-process as well as a multi-threaded (pthread) deployment model. See chapter Multi-process Support for more details.
Linuxapp EAL允许多进程以及多线程(pthread)部署模型。有关更多细节,请参见“3.3章多进程支持”一章
3.1.4. Memory Mapping Discovery and Memory Reservation
内存映射发现和内存分配
The allocation of large contiguous physical memory is done using the hugetlbfs kernel filesystem. The EAL provides an API to reserve named memory zones in this contiguous memory. The physical address of the reserved memory for that memory zone is also returned to the user by the memory zone reservation API.
大型连续物理内存的分配是使用hugetlbfs内核文件系统来完成的。EAL提供了一个API来在这个连续内存中分配已经被命名的内存区域。通过内存区域预留API接口将内存区域即用户预留的内存的物理地址返回给客户
There are two modes in which DPDK memory subsystem can operate: dynamic mode, and legacy mode. Both modes are explained below.
DPDK内存子系统可以运行两种模式:动态模式和预分配内存模式。下面将解释这两种模式。
备注:Memory reservations done using the APIs provided by rte_malloc are also backed by pages from the hugetlbfs filesystem.
使用rte_malloc提供的api完成的内存保留也得到hugetlbfs文件系统页面的支持。
译者注:IOVA:IO virtual address,可以理解为计算机的输入输出设备的虚拟地址
备注:Preallocated virtual memory is not to be confused with preallocated hugepage memory! All DPDK processes preallocate virtual memory at startup. Hugepages can later be mapped into that preallocated VA space (if dynamic memory mode is enabled), and can optionally be mapped into it at startup.
预分配的虚拟内存不能与预分配的大内存混淆!所有DPDK进程在启动时预分配虚拟内存。DPDK进程启动,Hugepages会被映射到预先分配的VA空间(如果启用了动态内存模式),也可以在启动时将其可选地映射到其中。
3.1.5. Support for Externally Allocated Memory支持外部分配内存
It is possible to use externally allocated memory in DPDK. There are two ways in which using externally allocated memory can work: the malloc heap API’s, and manual memory management.
可以在DPDK中使用外部分配的内存。使用外部分配的内存有两种方式可以工作:malloc堆API和手动内存管理。
The expected workflow is as follows:(预期的工作流程如下)
Get a pointer to memory area
Create a named heap
Add memory area(s) to the heap
If IOVA table is not specified, IOVA addresses will be assumed to be unavailable, and DMA mappings will not be performed
如果没有指定IOVA表,则假定IOVA地址不可用,并且不执行DMA映射
Other processes must attach to the memory area before they can use it
如果没有指定IOVA表,则假定IOVA地址不可用,并且不执行DMA映射
Get socket ID used for the heap
Use normal DPDK allocation procedures, using supplied socket ID
If memory area is no longer needed, it can be removed from the heap
Other processes must detach from this memory area before it can be removed。
其他进程必须先从该内存区域分离,然后才能删除该内存区域
If heap is no longer needed, remove it
Socket ID will become invalid and will not be reused.
套接字ID将无效,不会被重用
For more information, please refer to rte_malloc API documentation, specifically the rte_malloc_heap_* family of function calls.
有关更多信息,请参考rte_malloc API文档,特别是rte_malloc_heap_*系列函数调用。
3.1.6. Per-lcore and Shared Variables(每个CPU核的共享变量)
备注:lcore refers to a logical execution unit of the processor, sometimes called a hardware thread.
lcore是处理器的逻辑执行单元,有时也称为硬件线程。
译者注:Software threads are threads of execution managed by the operating system.
Hardware threads are a feature of some processors that allow better utilisation of the processor under some circumstances. They may be exposed to/by the operating system as appearing to be additional cores (“hyperthreading”).
关于CPU CPU核 硬件线程:https://software.intel.com/en-us/blogs/2008/10/29/on-processors-cores-and-hardware-threads
关于超线程:http://www.cs.virginia.edu/~mc2zk/cs451/vol6iss1_art01.pdf
Shared variables are the default behavior. Per-lcore variables are implemented using Thread Local Storage (TLS) to provide per-thread local storage.
共享变量是默认行为。Per-lcore(可以认为是每个CPU核)变量使用线程本地存储(TLS)实现,以提供每个线程的本地存储。
3.1.7. Logs
A logging API is provided by EAL. By default, in a Linux application, logs are sent to syslog and also to the console. However, the log function can be overridden by the user to use a different logging mechanism.
EAL提供了一个日志API。默认情况下,在Linux应用程序中,日志被发送到syslog和控制台。但是,用户可以重写日志函数以使用不同的日志记录机制。
3.1.7.1. Trace and Debug Functions
There are some debug functions to dump the stack in glibc. The
rte_panic() function can voluntarily provoke a SIG_ABORT, which can trigger the generation of a core file, readable by gdb.
在glibc中有一些调试函数可以转储堆栈。rte_panic()函数可以自动触发SIG_ABORT,它可以触发生成一个gdb可读的核心文件。
3.1.8. CPU Feature Identification
The EAL can query the CPU at runtime (using the rte_cpu_get_features() function) to determine which CPU features are available.
EAL可以在运行时查询CPU(使用rte_cpu_get_features()函数),以确定哪些CPU特性可用。
3.1.9. User Space Interrupt Event
备注:In DPDK PMD, the only interrupts handled by the dedicated host thread are those for link status change (link up and link down notification) and for sudden device removal. 在DPDK
PMD中,由专用主机线程处理的唯一中断是用于更改链接状态(链接上和链接下通知)和突然删除设备的中断。
备注:Per queue RX interrupt event is only allowed in VFIO which supports multiple MSI-X vector. In UIO, the RX interrupt together with other interrupt causes shares the same vector. In this case, when RX interrupt and LSC(link status change) interrupt are both enabled(intr_conf.lsc == 1 && intr_conf.rxq == 1), only the former is capable.
每个队列RX中断事件仅仅在VFIO中支持多个MSI-X向量。在UIO中,RX中断与其他中断共享相同的向量。在这种情况下,当RX中断和LSC(链接状态改变)中断都被启用(intr_conf.lsc == 1 && intr_conf.rxq == 1),只有前者生效。
译者注:MSI-X:PCIe总线的中断机制
The RX interrupt are controlled/enabled/disabled by ethdev APIs - ‘rte_eth_dev_rx_intr_*’. They return failure if the PMD hasn’t support them yet. The intr_conf.rxq flag is used to turn on the capability of RX interrupt per device.
RX中断由ethdev APIs - ’ rte_eth_dev_rx_intr_* '控制/启用/禁用。如果PMD还不支持他们,他们就会返回失败。intr_conf.rxq标志用于打开每个设备的RX中断功能。
3.1.10. Blacklisting(黑名单)
The EAL PCI device blacklist functionality can be used to mark certain NIC ports as blacklisted, so they are ignored by the DPDK. The ports to be blacklisted are identified using the PCIe* description (Domain:Bus:Device.Function).
EAL PCI设备黑名单功能可以将某些网卡端口标记为黑名单,因此被DPDK忽略。使用PCIe* description (Domain:Bus:Device.Function)标识要列入黑名单的端口。
3.1.11. Misc Functions(混合功能)
Locks and atomic operations are per-architecture (i686 and x86_64).
锁和原子操作是按体系结构进行的(i686和x86_64)。
3.1.12. IOVA Mode Configuration
Auto detection of the IOVA mode, based on probing the bus and IOMMU configuration, may not report the desired addressing mode when virtual devices that are not directly attached to the bus are present. To facilitate forcing the IOVA mode to a specific value the EAL command line option --iova-mode can be used to select either physical addressing(‘pa’) or virtual addressing(‘va’).
当没虚拟设备没有直接连接到当前的总线的时候,基于探测总线和IOMMU配置可以自动检测IOVA模式,可能不会报告所需的寻址模式时。
为了方便将IOVA模式强制设置为特定值,EAL命令行选项–iova-mode可用于选择物理寻址(“pa”)或虚拟寻址(“va”)。
3.2. Memory Segments and Memory Zones (memzone)(内存段和内存区)
The mapping of physical memory is provided by this feature in the EAL. As physical memory can have gaps, the memory is described in a table of descriptors, and each descriptor (called rte_memseg ) describes a physical page.
物理内存的映射是由EAL提供的。由于物理内存可能有间隙,内存在描述符表中进行描述,每个描述符(称为rte_memseg)描述一个物理页面。
On top of this, the memzone allocator’s role is to reserve contiguous portions of physical memory. These zones are identified by a unique name when the memory is reserved.
在此之上,memzone分配器的作用是预留物理内存的连续部分。当内存被预留时,这些区域由唯一的名称标识。
The rte_memzone descriptors are also located in the configuration structure. This structure is accessed using rte_eal_get_configuration(). The lookup (by name) of a memory zone returns a descriptor containing the physical address of the memory zone.
rte_memzone描述符也位于配置结构中。使用rte_eal_get_configuration()访问该结构。内存区域的查找(按名称)返回一个描述符,其中包含内存区域的物理地址。
Memory zones can be reserved with specific start address alignment by supplying the align parameter (by default, they are aligned to cache line size). The alignment value should be a power of two and not less than the cache line size (64 bytes). Memory zones can also be reserved from either 2 MB or 1 GB hugepages, provided that both are available on the system.
通过提供align参数,可以使用特定的起始地址对齐来保留内存区域(默认情况下,它们与缓存行大小对齐)。对齐值应为2的幂,且不小于缓存行大小(64字节)。内存区域也可以从2 MB或1 GB的Hugepages中保留,前提是这两个区域在系统上都可用。
Both memsegs and memzones are stored using rte_fbarray structures. Please refer to DPDK API Reference for more information.
Memsegs和MemZone都使用rte_FBarray结构存储。有关详细信息,请参阅DPDK API参考。
3.3. Multiple pthread
DPDK usually pins one pthread per core to avoid the overhead of task switching. This allows for significant performance gains, but lacks flexibility and is not always efficient.
DPDK通常为每个内核固定一个pthread,以避免任务切换的开销。这可以显著提高性能,但缺乏灵活性,而且并不总是有效的。
Power management helps to improve the CPU efficiency by limiting the CPU runtime frequency. However, alternately it is possible to utilize the idle cycles available to take advantage of the full capability of the CPU.
电源管理通过限制CPU运行时频率有助于提高CPU效率。但是,也可以利用空闲周期来充分利用CPU的全部功能。
By taking advantage of cgroup, the CPU utilization quota can be simply assigned. This gives another way to improve the CPU efficiency, however, there is a prerequisite; DPDK must handle the context switching between multiple pthreads per core.
利用cgroup,可以简单地分配CPU利用率配额。这提供了另一种提高CPU效率的方法,但是,这是一个先决条件;DPDK必须处理每个核心多个pthread之间的上下文切换。
For further flexibility, it is useful to set pthread affinity not only to a CPU but to a CPU set.
为了进一步的灵活性,设置pthread亲合性不仅对CPU有用,而且对CPU集也有用。
3.3.1. EAL pthread and lcore Affinity
The term “lcore” refers to an EAL thread, which is really a Linux/FreeBSD pthread. “EAL pthreads” are created and managed by EAL and execute the tasks issued by remote_launch. In each EAL pthread, there is a TLS (Thread Local Storage) called _lcore_id for unique identification. As EAL pthreads usually bind 1:1 to the physical CPU, the _lcore_id is typically equal to the CPU ID.
术语“lcore”指的是EAL线程,它实际上是一个Linux/FreeBSD pthread。“EAL pthreads”由EAL创建和管理,并执行remote_launch发出的任务。在每个EAL pthread中,都有一个名为_lcore_id的TLS(线程本地存储)用于惟一标识。由于EAL pthreads通常将1:1绑定到物理CPU,所以_lcore_id通常等于CPU ID。
When using multiple pthreads, however, the binding is no longer always 1:1 between an EAL pthread and a specified physical CPU. The EAL pthread may have affinity to a CPU set, and as such the _lcore_id will not be the same as the CPU ID. For this reason, there is an EAL long option ‘–lcores’ defined to assign the CPU affinity of lcores. For a specified lcore ID or ID group, the option allows setting the CPU set for that EAL pthread.
然而,在使用多个pthread时,EAL pthread和指定的物理CPU之间的绑定不再总是1:1。EAL pthread可能与一个CPU集具有关联性,因此,lcore id将与CPU id不同。因此,定义了一个EAL long选项“–lcores”,以分配lcores的CPU亲和力。对于指定的lcore id或id组,该选项允许为该eal pthread设置CPU集。
The format pattern:
–lcores=’
‘lcore_set’ and ‘cpu_set’ can be a single number, range or a group.
a number is a “digit([0-9]+)”;
a range is “-”;
a group is “(
If a ‘@cpu_set’ value is not supplied, the value of ‘cpu_set’ will default to the value of ‘lcore_set’.
Using this option, for each given lcore ID, the associated CPUs can be assigned. It’s also compatible with the pattern of corelist(‘-l’) option.
使用此选项,可以为每个给定的LCore ID分配相关的CPU。它还与corelist(“-l”)选项的模式兼容。
3.3.2. non-EAL pthread support
It is possible to use the DPDK execution context with any user pthread (aka. Non-EAL pthreads). In a non-EAL pthread, the _lcore_id is always LCORE_ID_ANY which identifies that it is not an EAL thread with a valid, unique, _lcore_id. Some libraries will use an alternative unique ID (e.g. TID), some will not be impacted at all, and some will work but with limitations (e.g. timer and mempool libraries).
可以将DPDK执行上下文与任何用户pthread(也称为非EAL pthread)一起使用。在非EAL pthread中,_lcore_id始终为LCORE_ID_ANY,它标识它不是具有有效唯一的_lcore_id的EAL线程。 有些库将使用备用唯一ID(例如TID),有些库根本不受影响,有些库可以使用但有限制(例如计时器和mempool库)。
All these impacts are mentioned in Known Issues section.
所有这些影响都在“已知问题”部分中提到。
3.3.3. Public Thread API
There are two public APIs rte_thread_set_affinity() and rte_thread_get_affinity() introduced for threads. When they’re used in any pthread context, the Thread Local Storage(TLS) will be set/get.
为线程引入了两个公共API rte_thread_set_affinity()和rte_thread_get_affinity()。当它们在任何pthread上下文中使用时,将设置/获取线程本地存储(TLS)。
Those TLS include _cpuset and _socket_id:
3.3.4. Known Issues
rte_mempool
The rte_mempool uses a per-lcore cache inside the mempool. For non-EAL pthreads, rte_lcore_id() will not return a valid number. So for now, when rte_mempool is used with non-EAL pthreads, the put/get operations will bypass the default mempool cache and there is a performance penalty because of this bypass. Only user-owned external caches can be used in a non-EAL context in conjunction with rte_mempool_generic_put() and rte_mempool_generic_get() that accept an explicit cache parameter.
rte_mempool在mempool中使用每个lcore缓存。对于非eal pthreads, rte_lcore_id()将不返回有效的数字。所以现在,当rte_mempool与非EAL pthreads一起使用时,put / get操作将绕过默认的mempool缓存,并且由于这种旁路而存在性能损失。只有用户拥有外部缓存时才能在非EAL上下文中与rte_mempool_generic_put()和rte_mempool_generic_get()一起使用,并接受显式缓存参数。
rte_ring
rte_ring supports multi-producer enqueue and multi-consumer dequeue. However, it is non-preemptive, this has a knock on effect of making rte_mempool non-preemptable.
RTE-Ring支持多生产者排队和多消费者排队。但是,它是不可抢占的,这有一个使RTE内存池不可抢占的连锁反应。
备注:The “non-preemptive” constraint means:(“非抢占式”约束是指)
a pthread doing multi-producers enqueues on a given ring must not be preempted by another pthread doing a multi-producer enqueue on the same ring.
在给定环上执行多个生产者排队的pthread不能被在同一环上执行多个生产者排队的另一个pthread抢先。
a pthread doing multi-consumers dequeues on a given ring must not be preempted by another pthread doing a multi-consumer dequeue on the same ring.
在给定环上执行多使用者出列的pthread不能被在同一环上执行多使用者出列的另一个pthread抢先。
Bypassing this constraint may cause the 2nd pthread to spin until the 1st one is scheduled again. Moreover, if the 1st pthread is preempted by a context that has an higher priority, it may even cause a dead lock.
绕过此约束可能会导致第二个pthread旋转,直到第一个pthread再次被调度。此外,如果第一个pthread被具有更高优先级的上下文抢先,它甚至可能导致死锁。
This means, use cases involving preemptible pthreads should consider using rte_ring carefully.
这意味着,涉及可抢占pthreads的用例应该仔细考虑使用rte-ring。
1 It CAN be used for preemptible single-producer and single-consumer use case.
它可以用于可抢占的单生产者和单消费者用例。
2 It CAN be used for non-preemptible multi-producer and preemptible single-consumer use case.
它可以用于不可抢占的多生产者和可抢占的单消费者用例。
3 It CAN be used for preemptible single-producer and non-preemptible multi-consumer use case.
它可以用于可抢占的单生产者和不可抢占的多消费者用例。
4 It MAY be used by preemptible multi-producer and/or preemptible multi-consumer pthreads whose scheduling policy are all SCHED_OTHER(cfs), SCHED_IDLE or SCHED_BATCH. User SHOULD be aware of the performance penalty before using it.
它可以由可抢占的多生产商和/或可抢占的多用户pthreads使用,其调度策略都是sched_other(cfs)、sched_idle或sched_batch。用户在使用前应注意性能惩罚。
5 多生产商/消费者pthreads不得使用它,其调度策略为sched_fifo或sched_rr。
多生产商/消费者pthreads不得使用它,其调度策略为sched_fifo或sched_rr。
rte_timer
Running rte_timer_manage() on a non-EAL pthread is not allowed. However, resetting/stopping the timer from a non-EAL pthread is allowed.
不允许在非eal pthread上运行rte_timer_manage()。但是,允许从非EAL pthread重置/停止计时器。
rte_log
In non-EAL pthreads, there is no per thread loglevel and logtype, global loglevels are used.
在非eal pthreads中,每个线程不存在个自的日志级别和日志类型,使用全局日志级别。
misc
The debug statistics of rte_ring, rte_mempool and rte_timer are not supported in a non-EAL pthread.
非eal pthread不支持rte_ring、rte_mempool和rte_timer的调试统计信息。
3.3.5. cgroup control
The following is a simple example of cgroup control usage, there are two pthreads(t0 and t1) doing packet I/O on the same core ($CPU). We expect only 50% of CPU spend on packet IO.
下面是使用cgroup control的一个简单示例,有两个pthread (t0和t1)在同一个内核(CPU)上执行包I/O。我们期望在包IO上只花费50%的CPU。
译者注:Cgroups是control groups的缩写,是Linux内核提供的一种可以限制、记录、隔离进程组(process groups)所使用的物理资源(如:cpu,memory,IO等等)的机制。
3.4. Malloc
The EAL provides a malloc API to allocate any-sized memory.
EAL提供了一个malloc API来分配任意大小的内存。
The objective of this API is to provide malloc-like functions to allow allocation from hugepage memory and to facilitate application porting. The DPDK API Reference manual describes the available functions.
此API的目标是提供类似malloc的函数,以允许从大页面内存进行分配并便于应用程序移植。DPDK API参考手册介绍了可用的功能。
Typically, these kinds of allocations should not be done in data plane processing because they are slower than pool-based allocation and make use of locks within the allocation and free paths. However, they can be used in configuration code.
通常,这些类型的分配不应该在数据平面处理中完成,因为它们比基于池的分配慢,并且在分配和自由路径中使用锁。但是,可以在配置代码中使用它们。
Refer to the rte_malloc() function description in the DPDK API Reference manual for more information.
有关更多信息,请参阅DPDK API参考手册中的rte_malloc()函数描述。
3.4.1. Cookies
When CONFIG_RTE_MALLOC_DEBUG is enabled, the allocated memory contains overwrite protection fields to help identify buffer overflows.
启用CONFIG_RTE_MALLOC_DEBUG时,分配的内存包含覆盖保护字段,以帮助识别缓冲区溢出。
3.4.2. Alignment and NUMA Constraints(对齐和NUMA约束)
The rte_malloc() takes an align argument that can be used to request a memory area that is aligned on a multiple of this value (which must be a power of two).
rte_malloc()接受一个align参数,该参数可用于请求与该值的倍数(必须是2的幂)对齐的内存区域
On systems with NUMA support, a call to the rte_malloc() function will return memory that has been allocated on the NUMA socket of the core which made the call. A set of APIs is also provided, to allow memory to be explicitly allocated on a NUMA socket directly, or by allocated on the NUMA socket where another core is located, in the case where the memory is to be used by a logical core other than on the one doing the memory allocation.
在支持numa的系统上,对rte_malloc()函数的调用将返回在调用核心的numa套接字上分配的内存。还提供了一组API,允许直接在一个NUMA套接字上显式分配内存,或者通过在另一个内核所在的NUMA套接字上分配内存,在这种情况下,内存将由一个逻辑内核使用,而不是由一个进行内存分配的内核使用。
3.4.3. Use Cases
This API is meant to be used by an application that requires malloc-like functions at initialization time.
此API用于初始化时需要类似malloc函数的应用程序。
For allocating/freeing data at runtime, in the fast-path of an application, the memory pool library should be used instead.
为了在运行时分配/释放数据,应在应用程序的快速路径中使用内存池库。
3.4.4. Internal Implementation
3.4.4.1. Data Structures
There are two data structure types used internally in the malloc library:
3.4.4.1.1. Structure: malloc_heap
The malloc_heap structure is used to manage free space on a per-socket basis. Internally, there is one heap structure per NUMA node, which allows us to allocate memory to a thread based on the NUMA node on which this thread runs. While this does not guarantee that the memory will be used on that NUMA node, it is no worse than a scheme where the memory is always allocated on a fixed or random node.
malloc_heap结构用于在每个套接字的基础上管理空闲空间。在内部,每个numa节点有一个堆结构,它允许我们根据运行该线程的numa节点为线程分配内存。虽然这不能保证内存将在该NUMA节点上使用,但它并不比总是在固定或随机节点上分配内存的方案差。
he key fields of the heap structure and their function are described below (see also diagram above):
堆结构的关键字段及其功能如下(参见上图):
备注:If the usage of a particular field in one of the above three usages is not described, the field can be assumed to have an undefined value in that situation, for example, for padding headers only the “state” and “pad” fields have valid values.
如果在上述三种用法之一中没有描述特定字段的用法,则可以假定该字段在这种情况下具有未定义的值,例如,对于填充头,只有“state”和“pad”字段具有有效值。
3.4.4.2. Memory Allocation
On EAL initialization, all preallocated memory segments are setup as part of the malloc heap. This setup involves placing an element header with FREE at the start of each virtually contiguous segment of memory. The FREE element is then added to the free_list for the malloc heap.
在EAL初始化时,所有预分配的内存段都设置为malloc堆的一部分。这个设置包括在每个几乎连续的内存段的开头放置一个带有free的元素头。然后将free元素添加到malloc堆的free_列表中。
This setup also happens whenever memory is allocated at runtime (if supported), in which case newly allocated pages are also added to the heap, merging with any adjacent free segments if there are any.
每当在运行时分配内存(如果支持)时,也会发生这种设置,在这种情况下,新分配的页面也会添加到堆中,如果存在任何相邻的空闲段,则会与之合并。
When an application makes a call to a malloc-like function, the malloc function will first index the lcore_config structure for the calling thread, and determine the NUMA node of that thread. The NUMA node is used to index the array of malloc_heap structures which is passed as a parameter to the heap_alloc() function, along with the requested size, type, alignment and boundary parameters.
When an application makes a call to a malloc-like function, the malloc function will first index the lcore_config structure for the calling thread, and determine the NUMA node of that thread. The NUMA node is used to index the array of malloc_heap structures which is passed as a parameter to the heap_alloc() function, along with the requested size, type, alignment and boundary parameters.
When an application makes a call to a malloc-like function, the malloc function will first index the lcore_config structure for the calling thread, and determine the NUMA node of that thread. The NUMA node is used to index the array of malloc_heap structures which is passed as a parameter to the heap_alloc() function, along with the requested size, type, alignment and boundary parameters.
当应用程序调用类似malloc的函数时,malloc函数将首先索引调用线程的lcore-config结构,并确定该线程的numa节点。numa节点用于索引作为参数传递给heap_alloc()函数的malloc_堆结构数组,以及请求的大小、类型、对齐方式和边界参数。
The heap_alloc() function will scan the free_list of the heap, and attempt to find a free block suitable for storing data of the requested size, with the requested alignment and boundary constraints.
When a suitable free element has been identified, the pointer to be returned to the user is calculated. The cache-line of memory immediately preceding this pointer is filled with a struct malloc_elem header. Because of alignment and boundary constraints, there could be free space at the start and/or end of the element, resulting in the following behavior:
当识别出要释放的元素时,计算要返回给用户的指针。 紧接在该指针之前的内存的高速缓存行用struct malloc_elem头填充。 由于对齐和边界约束,元素的开头和/或结尾可能有释放的空间,导致以下行为:
The advantage of allocating the memory from the end of the existing element is that no adjustment of the free list needs to take place - the existing element on the free list just has its size value adjusted, and the next/previous elements have their “prev”/”next” pointers redirected to the newly created element.
从现有元素的末尾分配内存的优点是不需要调整空闲列表 - 空闲列表中的现有元素只调整其大小值,而下一个/前一个元素具有“prev” “/”next“指针重定向到新创建的元素。
In case when there is not enough memory in the heap to satisfy allocation request, EAL will attempt to allocate more memory from the system (if supported) and, following successful allocation, will retry reserving the memory again. In a multiprocessing scenario, all primary and secondary processes will synchronize their memory maps to ensure that any valid pointer to DPDK memory is guaranteed to be valid at all times in all currently running processes.
如果堆中没有足够的内存来满足分配请求,EAL将尝试从系统分配更多内存(如果支持),并且在成功分配后,将重试再次保留内存。 在多处理方案中,所有主进程和辅助进程将同步其内存映射,以确保任何有效的DPDK内存指针在所有当前正在运行的进程中始终保证有效。
Failure to synchronize memory maps in one of the processes will cause allocation to fail, even though some of the processes may have allocated the memory successfully. The memory is not added to the malloc heap unless primary process has ensured that all other processes have mapped this memory successfully.
无法在其中一个进程中同步内存映射将导致分配失败,即使某些进程可能已成功分配内存。 除非主进程确保所有其他进程已成功映射此内存,否则不会将内存添加到malloc堆中。
Any successful allocation event will trigger a callback, for which user applications and other DPDK subsystems can register. Additionally, validation callbacks will be triggered before allocation if the newly allocated memory will exceed threshold set by the user, giving a chance to allow or deny allocation.
任何成功的分配事件都将触发回调,用户应用程序和其他DPDK子系统可以为此注册。此外,如果新分配的内存将超过用户设置的阈值,则在分配之前将触发验证回调,从而提供允许或拒绝分配的机会。
备注:Any allocation of new pages has to go through primary process. If the primary process is not active, no memory will be allocated even if it was theoretically possible to do so. This is because primary’s process map acts as an authority on what should or should not be mapped, while each secondary process has its own, local memory map.Secondary processes do not update the shared memory map, they only copy its contents to their local memory map.
任何新页面的分配都必须经过主进程。 如果主进程未激活,即使理论上可以这样做,也不会分配任何内存。 这是因为主进程的映射充当应该或不应映射的权限,而每个辅助进程都有自己的本地内存映射。 辅助进程不更新共享内存映射,它们只将其内容复制到本地内存映射。
3.4.4.3. Freeing Memory
To free an area of memory, the pointer to the start of the data area is passed to the free function. The size of the malloc_elem structure is subtracted from this pointer to get the element header for the block. If this header is of type PAD then the pad length is further subtracted from the pointer to get the proper element header for the entire block.
要释放一个内存区域,指向数据区域开头的指针将传递给free函数。从这个指针中减去malloc元素结构的大小,得到块的元素头。如果这个头是pad类型,那么pad长度将从指针中进一步减去,以得到整个块的正确元素头。
To free an area of memory, the pointer to the start of the data area is passed to the free function. The size of the malloc_elem structure is subtracted from this pointer to get the element header for the block. If this header is of type PAD then the pad length is further subtracted from the pointer to get the proper element header for the entire block.
从这个元素头中,我们得到指向分配块的堆以及必须释放块的位置的指针,以及指向前一个和下一个元素的指针。然后检查这些下一个和上一个元素,以查看它们是否也是空闲的,并且是否与当前元素相邻;如果是,则将它们与当前元素合并。这意味着我们永远不能让两个空闲内存块彼此相邻,因为它们总是合并到一个块中。
If deallocating pages at runtime is supported, and the free element encloses one or more pages, those pages can be deallocated and be removed from the heap. If DPDK was started with command-line parameters for preallocating memory (-m or --socket-mem), then those pages that were allocated at startup will not be deallocated.
如果支持在运行时释放页面,并且free元素包含一个或多个页面,则可以释放这些页面并将其从堆中删除。如果DPDK是用预分配内存的命令行参数(-m或–socket mem)启动的,那么在启动时分配的那些页将不会被释放。
Any successful deallocation event will trigger a callback, for which user applications and other DPDK subsystems can register.
任何成功的释放事件都将触发回调,用户应用程序和其他DPDK子系统可以为此注册。