该篇是学习使用PYNQ开发板,实际上是对ZYNQ PL端AXI_CDMA 核的应用。实验步骤参照官网的教程,一步一步地做,但是由于在硬件资源布置方面与官方教程稍有出入,所以在SDK的源码里也进行了修改。
AXI_CDMA特性:
如果是使用ZYNQ 7系列芯片(可能其他Xilinx也是通用的),xilinx的AXI_CDMA 核有两种传输模式:轮询(poll)和中断(intr);这两种传输模式指的是arm应如何获取DMA的传输状态,如果是轮询模式,就需要arm查询DMA是否bussy;如果是中断模式,那么在DMA传输完成时会触发外部中断,此时需要相应的中断处理函数来填写DMA传输状态。
axi_cdma还有两种工作状态:Simple DMA transfer和Scatter gather (SG) DMA transfer,简单传输适合任务压力不大的情况;分散收集传输,实际上是AXI_CDMA的并发特性,可以并行传输多包数据,此时要求传输的数据必须是一种特定的数据类型下的实体对象。
综合上述情况,AXI_CDMA至少就有四种应用场景分别为:Simple DMA transfer_poll、Simple DMA transfer_intr、SG DMA transfer_poll和SG DMA transfer_intr。
以下是axi_CDMA的文档,如果安装了SDK就可以直接在板级支持包(BSP)的system.mss文件中找到链接:
axicdma_v4_5 Documentation
This is the driver API for the AXI CDMA engine. For a full description of the features of the AXI CDMA engine, please refer to the hardware specification. This driver supports the following features:
Simple DMA transfer
Scatter gather (SG) DMA transfer
Interrupt for error or completion of transfers
For SG DMA transfer:
Programmable interrupt coalescing
Programmable delay timer counter
Managing the buffer descriptors (BDs)
Two Hardware Building Modes
The hardware can be built in two modes:
Simple only mode, in this mode, only simple transfers are supported by the hardware. The functionality is similar to the XPS Central DMA, however, the driver API to do the transfer is slightly different.
Hybrid mode, in this mode, the hardware supports both the simple transfer and the SG transfer. However, only one kind of transfer can be active at a time. If an SG transfer is ongoing in the hardware, a submission of a simple transfer fails. If a simple transfer is ongoing in the hardware, a submission of an SG transfer is successful, however the SG transfer will not start until the simple transfer is done.
Transactions
The hardware supports two types of transfers, the simple DMA transfer and the scatter gather (SG) DMA transfer.
A simple DMA transfer only needs source buffer address, destination buffer address and transfer length to do a DMA transfer. Only one transfer can be submitted to the hardware at a time.
A SG DMA transfer requires setting up a buffer descriptor (BD), which keeps the transfer information, including source buffer address, destination buffer address, and transfer length. The hardware updates the BD for the completion status of the transfer. BDs that are connected to each other can be submitted to the hardware at once, therefore, the SG DMA transfer has better performance when the application is doing multiple transfers each time.
Callback Function
Each transfer, for which the application cares about its completion, should provide with the driver its callback function. The signature of the callback function is as the following:
void XAxiCdma_CallBackFn(void *CallBackRef, u32 IrqMask, int *NumPtr);
Where the CallBackRef is a reference pointer that the application passes to the driver along with the callback function. The driver passes IrqMask to the application when it calls this callback. The NumPtr is only used in SG mode to track how many BDs still left for this callback function.
The callback function is set upon transfer submission:
Simple transfer callback function setup:
Only set the callback function if in interrupt mode.
For simple transfers, the callback function along with the callback reference pointer is passed to the driver through the submission of the simple transfer:
XAxiCdma_SimpleTransfer(...)
SG transfer callback function setup: For SG transfers, the callback function and the callback reference pointer are set through the transfer submission call:
XAxiCdma_BdRingToHw(...)
Simple Transfers
For an application that only does one DMA transfer at a time, and the DMA engine is exclusively used by this application, simple DMA transfer is sufficient.
Using the simple DMA transfer has the advantage of ease of use comparing to SG DMA transfer. For an individual DMA transfer, simple DMA transfer is also faster because of simplicity in software and hardware.
Scatter Gather (SG) Transfers
For an application that has multiple DMA transfers sometimes, or the DMA engine is shared by multiple applications, using SG DMA transfer yields better performance over all applications.
The SG DMA transfer provides queuing of multiple transfers, therefore, it provides better performance because the hardware can continuously work on all submitted transfers without software intervention.
The down side of using the SG DMA transfer is that you have to manage the memory for the buffer descriptors (BD), and setup BDs for the transfers.
Interrupts
The driver handles the interrupts.
The completion of a transfer, that has a callback function associated with, will trigger the driver to call the callback function. The IrqMask that is passed through the callback function notifies the application about the completion status of the transfer.
Interrupt Coalescing for SG Transfers
For SG transfers, the application can program the interrupt coalescing threshold to reduce the frequency of interrupts. If the number of transfers does not match well with the interrupt coalescing threshold, the completion of the last transfer will not trigger the completion interrupt. However, after the specified delay count time, the delay interrupt will fire.
By default, the interrupt threshold for the hardware is one, which is one interrupt per BD completion.
Delay Interrupt for SG Transfers
Delay interrupt is to signal the application about inactivity of transfers. If the delay interrupt is enabled, the delay timer starts counting down once a transfer has started. If the interval between transfers is longer than the delay counter, the delay interrupt is fired.
By default, the delay counter is zero, which means the delay interrupt is disabled. To enable delay interrupt, the delay interrupt enable bit must be set and the delay counter must be set to a value between 1 to 255.
BD management for SG DMA Transfers
BD is shared by the software and the hardware. To use BD for SG DMA transfers, the application needs to use the driver API to do the following:
Setup the BD ring:
XAxiCdma_BdRingCreate(...)
Note that the memory for the BD ring is allocated and is later de-allocated by the application.
Request BD from the BD ring, more than one BDs can be requested at once:
XAxiCdma_BdRingAlloc(...)
Prepare BDs for the transfer, one BD at a time:
XAxiCdma_BdSetSrcBufAddr(...)
XAxiCdma_BdSetDstBufAddr(...)
XAxiCdma_BdSetLength(...)
Submit all prepared BDs to the hardware:
XAxiCdma_BdRingToHw(...)
Upon transfer completion, the application can request completed BDs from the hardware:
XAxiCdma_BdRingFromHw(...)
After the application has finished using the BDs, it should free the BDs back to the free pool:
XAxiCdma_BdRingFree(...)
The driver also provides API functions to get the status of a completed BD, along with get functions for other fields in the BD.
The following two diagrams show the correct flow of BDs:
The first diagram shows a complete cycle for BDs, starting from requesting the BDs to freeing the BDs.
XAxiCdma_BdRingAlloc() XAxiCdma_BdRingToHw()
Free ------------------------> Pre-process ----------------------> Hardware
|
/|\ |
| XAxiCdma_BdRingFree() XAxiCdma_BdRingFromHw() |
+--------------------------- Post-process <----------------------+
The second diagram shows when a DMA transfer is to be cancelled before enqueuing to the hardware, application can return the requested BDs to the free group using XAxiCdma_BdRingUnAlloc().
XAxiCdma_BdRingUnAlloc()
Free <----------------------- Pre-process
Physical/Virtual Addresses
Addresses for the transfer buffers are physical addresses.
For SG transfers, the next BD pointer in a BD is also a physical address.
However, application's reference to a BD and to the transfer buffers are through virtual addresses.
The application is responsible to translate the virtual addresses of its transfer buffers to physical addresses before handing them to the driver.
For systems where MMU is not used, or MMU is a direct mapping, then the physical address and the virtual address are the same.
Cache Coherency
To prevent cache and memory inconsistency:
Flush the transmit buffer range before the transfer
Invalidate the receive buffer range before passing it to the hardware and before passing it to the application
For SG transfers:
Flush the BDs once the preparation setup is done
Invalidate the memory region for BDs when BDs are retrieved from the hardware.
BD alignment for SG Transfers
The hardware has requirement for the minimum alignment of the BDs, XAXICDMA_BD_MINIMUM_ALIGNMENT. It is OK to have an alignment larger than the required minimum alignment, however, it must be multiple of the minimum alignment. The alignment is passed into XAxiCdma_BdRingCreate().
Error Handling
The hardware halts upon all error conditions. The driver will reset the hardware once the error occurs.
The IrqMask argument in the callback function notifies the application about error conditions for the transfer.
Mutual Exclusion
The driver does not provide mutual exclusion mechanisms, it is up to the upper layer to handle this.
Hardware Defaults & Exclusive Use
The hardware is in the following condition on start or after a reset:
All interrupts are disabled.
The engine is in simple mode.
Interrupt coalescing counter is one.
Delay counter is 0.
The driver has exclusive use of the hardware registers and BDs. Accessing the hardware registers or the BDs should always go through the driver API functions.
Hardware Features That User Should Be Aware of
For performance reasons, the driver does not check the submission of transfers during run time. It is the user's responsibility to submit approrpiate transfers to the hardware. The following hardware features should be considerred when submitting a transfer:
. Whether the hardware supports unaligned transfers, reflected through C_INCLUDE_DRE in system.mhs file. Submitting unaligned transfers while the hardware does not support it, causes errors upon transfer submission. Aligned transfer is in respect to word length, and word length is defined through the building parameter XPAR_AXI_CDMA_0_M_AXI_DATA_WIDTH.
. Memory range of the transfer addresses. Transfer data to executable memory can crash the system.
. Lite mode. To save hardware resources (drastically), you may select "lite" mode build of the hardware. However, with lite mode, the following features are not supported:
Cross page boundary transfer. Each transfer must be restrictly inside one page; otherwise, slave error occurs.
Unaligned transfer.
Data width larger than 64 bit
Maximum transfer length each time is limited to data_width * burst_len
MODIFICATION HISTORY:
. Updated the debug print on type casting to avoid warnings on u32. Cast
u32 to (unsigned int) to use the x format.
Ver Who Date Changes
1.00a jz 07/08/10 First release
2.01a rkv 01/25/11 Added TCL script to generate Test App code for peripheral
tests.
Replaced with "\r\n" in place on "\n\r" in printf
statements. Made some minor modifications for Doxygen
2.02a srt 01/18/13 Added support for Key Hole feature (CR: 687217).
Updated DDR base address for IPI designs (CR 703656).
2.03a srt 04/13/13 Removed Warnings (CR 705006).
Added logic to check if DDR is present in the test app
tcl file. (CR 700806)
3.0 adk 19/12/13 Updated as per the New Tcl API's
4.0 adk 27/07/15 Added support for 64-bit Addressing.
4.1 sk 11/10/15 Used UINTPTR instead of u32 for Baseaddress CR# 867425.
Changed the prototype of XAxiCdma_CfgInitialize API.
4.3 mi 09/21/16 Fixed compilation warnings.
ms 01/22/17 Modified xil_printf statement in main function for all
examples to ensure that "Successfully ran" and "Failed" strings
are available in all examples. This is a fix for CR-965028.
ms 03/17/17 Added readme.txt file in examples folder for doxygen
generation.
ms 04/05/17 Modified Comment lines in functions of axicdma
examples to recognize it as documentation block
for doxygen generation of examples.
数据对齐在DMA应用中的关键性:
先贴一段源码:
/*****************************************************************************/
/**
* This function does one simple transfer submission
*
* It checks in the following sequence:
* - if engine is busy, cannot submit
* - if software is still handling the completion of the previous simple
* transfer, cannot submit
* - if engine is in SG mode and cannot switch to simple mode, cannot submit
*
* @param InstancePtr is the pointer to the driver instance
* @param SrcAddr is the address of the source buffer
* @param DstAddr is the address of the destination buffer
* @param Length is the length of the transfer
* @param SimpleCallBack is the callback function for the simple transfer
* @param CallBackRef is the callback reference pointer
*
* @return
* - XST_SUCCESS for success of submission
* - XST_FAILURE for submission failure, maybe caused by:
* Another simple transfer is still going
* . Another SG transfer is still going
* - XST_INVALID_PARAM if:
* Length out of valid range [1:8M]
* Or, address not aligned when DRE is not built in
*
* @note Only set the callback function if using interrupt to signal
* the completion.If used in polling mode, please set the callback
* function to be NULL.
*
*****************************************************************************/
u32 XAxiCdma_SimpleTransfer(XAxiCdma *InstancePtr, UINTPTR SrcAddr, UINTPTR DstAddr,
int Length, XAxiCdma_CallBackFn SimpleCallBack, void *CallBackRef)
{
u32 WordBits;
printf("***********");
if ((Length < 1) || (Length > XAXICDMA_MAX_TRANSFER_LEN)) {
return XST_INVALID_PARAM;
}
WordBits = (u32)(InstancePtr->WordLength - 1);
//如果没按照数据对齐来分配源或目的地址,那么条件就成立
printf("wordbits %x ;srcadd %x ; dstadd %x",WordBits,SrcAddr,DstAddr);
if ((SrcAddr & WordBits) || (DstAddr & WordBits)) {
printf("***********1");
if (!InstancePtr->HasDRE) {
printf("***********2");
xdbg_printf(XDBG_DEBUG_ERROR,
"Unaligned transfer without DRE %x/%x\r\n",
(unsigned int)SrcAddr, (unsigned int)DstAddr);
printf("***********3");
return XST_INVALID_PARAM;
}
}
这个接口后面还有一点,就是操作寄存器来传输。这段代码主要进行源地址和目的地址的验证,判断是否满足数据对齐。在该项目中测试了DDR->DDR、BlockRam->DDR和DDR->BlockRam的数据搬移测试,基地址如下:
#define PROCESSOR_BRAM_MEMORY 0x40000000 // BRAM Port A mapped through 1st BRAM Controller accessed by CPU
#define CDMA_BRAM_MEMORY 0xC0000000 // BRAM Port B mapped through 2nd BRAM Controller accessed by CDMA
#define DDR_MEMORY 0x01000000
可以看到CDMA与PS端访问BlockRam的基地址不一样,原因是BlockRam是一个双端口Ram,PS和CDMA分别从两个端口访问得到的同一个偏移地址下的存储单元是一样的。该项目中数据传输64bit宽度的数据,也就是8byte,那么要保证数据对齐需要基地址+偏移地址的低3位都是0。(详细的数据对齐概念)
测试程序功能:
1)所有设备的初始化
2)菜单程序:与用户交互
3)对比直接使用处理器完成数据搬移和使用DMA进行数据搬移的效率(通过定时器计数衡量)
用于了解执行流程的部分源码:
// Initialize src memory
for (i=0; i
测试结果:
当之传输1个字(ZYNQ上是4字节)时,处理器的搬移速度相对于轮询模式DMA简单传输较快,相对于中断模式DMA简单传输较慢
-- Simple DMA Design Example --
Above message printing took 4205 clock cycles
Central DMA Initialized
Setting up interrupt system
Enter number of words you want to transfer between 1 and 8192
1
Enter 1 for BRAM to DDR3 transfer
Enter 2 for DDR3 to BRAM transfer
Enter 3 for DDR3 to DDR3 transfer
Enter 4 to exit
3DDR to DDR transfer
Moving 4 bytes through processor took 913 clock cycles
Starting transfer through DMA in poll mode
Moving 4 bytes through DMA in poll mode took 28300 clock cycles
Transfer complete
Transfered data verified
Improvement using Polled DMA -2999 %
Moving 4 bytes through DMA in Interrupt mode took 253 clock cycles
Transfer complete
Transfered data verified
Improvement using Interrupt DMA 72 %
当传输1000字(ZYNQ4KB)时,处理器的数据搬移速度明显比DMA处理的慢,并且处于中断模式下的DMA处理更加高效
-- Simple DMA Design Example --
Above message printing took 4215 clock cycles
Central DMA Initialized
Setting up interrupt system
Enter number of words you want to transfer between 1 and 8192
1000
Enter 1 for BRAM to DDR3 transfer
Enter 2 for DDR3 to BRAM transfer
Enter 3 for DDR3 to DDR3 transfer
Enter 4 to exit
3DDR to DDR transfer
Moving 4000 bytes through processor took 653786 clock cycles
Starting transfer through DMA in poll mode
Moving 4000 bytes through DMA in poll mode took 30751 clock cycles
Transfer complete
Transfered data verified
Improvement using Polled DMA 95 %
Moving 4000 bytes through DMA in Interrupt mode took 4055 clock cycles
Transfer complete
Transfered data verified
Improvement using Interrupt DMA 99 %