Deep Dive into Contiguous Memory Allocator

This is the first part of an extended versionof an LWN article onCMA. It contains much more detail on how to use CMA, and a lot ofboring code samples. Should you be more interested in an overview,consider reading theoriginal instead.

ContiguousMemory Allocator (or CMA) has been developed to allow bigphysically contiguous memory allocations. By initialising early atboot time and with some fairly intrusive changes to Linux memorymanagement, it is able to allocate big memory chunks without a need tograb memory for exclusive use.

Simple in principle, it grew to be a quite complicated system whichrequires cooperation between boot-time allocator, buddy system, DMAsubsystem, and some architecture-specific code. Still, all thatcomplexity is usually hidden away and normal users won't be exposed toit. Depending on perspective, CMA appears slightly different andthere are different things to be done and look for.

Using CMA in device drivers

From device driver author's point of view, nothing should change.CMA is integrated with DMA subsystem, so the usual calls to the DMAAPI (such as dma_alloc_coherent) should work as usual.

In fact, device drivers should never need to call the CMA APIdirectly. Most importantly, device drivers operate on kernel mappingsand bus addresses whereas CMA operates on pages and PFNs.Furthermore, CMA does not handle cache coherency, which the DMA APIwas designed to deal with. Lastly, it is more flexible and allowsallocations in atomic contexts (e.g. interrupt handler) and creationof memory pools, which are well suited for small allocations.

For a quick example, this is how allocation might look like:

dma_addr_t dma_addr;
void *virt_addr =
	dma_alloc_coherent(dev, 100 << 20, &dma_addr, GFP_KERNEL);
if (!virt_addr)
	return -ENOMEM;

Provided that dev is a pointer to a valid structdevice, the above code will allocate 100 MiBs of memory. It mayor may not be a CMA memory, but it is a portable way to get buffers.The following can be used to free it:

dma_free_coherent(dev, 100 << 20, virt_addr, dma_addr);

BarrySong has posted a very simple test driver which uses those two toallocate DMA memory.

More information about the DMA API can be found in Documentation/DMA-API.txtand Documentation/DMA-API-HOWTO.txt.Those two documents describe provided functions and giveusage examples.

Integration with architecture code

Obviously, CMA has to be integrated with given architecture's DMAsubsystem beforehand. This is performed in a few, fairly easy steps.The CMApatchset integrates it withx86andARMarchitectures. This section will refer to both patches as well asquote their relevant portions.

Reserving memory

CMA works by reserving memory early at boot time. This memory,called CMA area or CMA context, is laterreturned to the buddy system so it can be used by regularapplications. To make reservation happen, one needs to call:

void dma_contiguous_reserve(
	phys_addr_t limit);

just after memblock is initialised but prior to the buddy allocatorsetup.

The limit argument, if not zero, specifies physicaladdress above which no memory will be prepared for CMA. Intention isto allow limiting CMA contexts to addresses that DMA can handle. Theonly real constraint that CMA imposes is that reserved memory mustbelong to the same zone.

In case of ARM the limit is set to arm_dma_limit orarm_lowmem_limit, whichever is smallest:

diff --git a/arch/arm/mm/init.c b/arch/arm/mm/init.c
@@ -364,6 +373,12 @@ void __init arm_memblock_init(struct meminfo *mi, struct machine_desc *mdesc)
    if (mdesc->reserve)
            mdesc->reserve();

+	/*
+	 * reserve memory for DMA contigouos allocations,
+	 * must come from DMA area inside low memory
+	 */
+	dma_contiguous_reserve(min(arm_dma_limit, arm_lowmem_limit));
+
 	arm_memblock_steal_permitted = false;
 	memblock_allow_resize();
 	memblock_dump_all();

On x86 it is called just after memblock is set up insetup_arch function with no limit specified:

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
@@ -934,6 +935,7 @@ void __init setup_arch(char **cmdline_p)
 	}
 #endif
 	memblock.current_limit = get_max_mapped();
+	dma_contiguous_reserve(0);

 	/*
 	 * NOTE: On x86-32, only from this point on, fixmaps are ready for use.

The amount of reserved memory depends on a few Kconfig options anda cma kernel parameters which will be describelater on.

Architecture specific memory preparations

dma_contiguous_reserve function will reserve memory andprepare it to be used with CMA. On some architecturesarchitecture-specific work may need to be performed as well. To allowthat, CMA will call the following function:

void dma_contiguous_early_fixup(
	phys_addr_t base,
	unsigned long size);

It is architecture's responsibility to provide it along with itsdeclaration in asm/dma-contiguous.h header file. Thefunction will be called quite early, so some of the kernelsubsystems – ike kmalloc – will not be available.Furthermore, it may be called several times, but no more thanMAX_CMA_AREAS times.

If an architecture does not need any special handling, the headerfile may just say:

#ifndef H_ARCH_ASM_DMA_CONTIGUOUS_H
#define H_ARCH_ASM_DMA_CONTIGUOUS_H
#ifdef __KERNEL__

#include <linux/types.h>
#include <asm-generic/dma-contiguous.h>

static inline void
dma_contiguous_early_fixup(phys_addr_t base, unsigned long size)
{ /* nop */ }

#endif
#endif

ARM requires some workmodifying mappings and so it provides a full definition of thisfunction:

diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
[…]
+static struct dma_contig_early_reserve dma_mmu_remap[MAX_CMA_AREAS] __initdata;
+
+static int dma_mmu_remap_num __initdata;
+
+void __init dma_contiguous_early_fixup(phys_addr_t base, unsigned long size)
+{
+	dma_mmu_remap[dma_mmu_remap_num].base = base;
+	dma_mmu_remap[dma_mmu_remap_num].size = size;
+	dma_mmu_remap_num++;
+}
+
+void __init dma_contiguous_remap(void)
+{
+	int i;
+	for (i = 0; i < dma_mmu_remap_num; i++) {
		[…]
+	}
+}
diff --git a/arch/arm/mm/mmu.c b/arch/arm/mm/mmu.c
@@ -1114,11 +1122,12 @@ void __init paging_init(struct machine_desc *mdesc)
 {
 	void *zero_page;

-	memblock_set_current_limit(lowmem_limit);
+	memblock_set_current_limit(arm_lowmem_limit);

 	build_mem_type_table();
 	prepare_page_table();
 	map_lowmem();
+	dma_contiguous_remap();
 	devicemaps_init(mdesc);
 	kmap_init();

DMA subsystem integration

Second thing to do is to change architecture's DMA API to use thewhole machinery. To allocate memory from CMA one uses:

struct page *dma_alloc_from_contiguous(
	struct device *dev,
	int count,
	unsigned int align);

Its first argument is the device allocation is performed on behalfof. The second one specifies number of pages (not bytes ororder) to allocate.

The third argument is the alignment expressed as a page order. Itenables allocation of buffers which are aligned to at least2align pages. To avoid fragmentation, if at allpossible, pass zero here. It is worth noting that there is a Kconfigoption (CONFIG_CMA_ALIGNMENT) which specifies maximalalignment accepted by the function. By default, its value is 8meaning an alignment of 256 pages.

The return value is the first of a sequence of countallocated pages.

Here's how allocation looks on x86:

diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
@@ -99,14 +99,18 @@ void *dma_generic_alloc_coherent(struct device *dev, size_t size,
 				 dma_addr_t *dma_addr, gfp_t flag)
 {
	[…]
 again:
-	page = alloc_pages_node(dev_to_node(dev), flag, get_order(size));
+	if (!(flag & GFP_ATOMIC))
+		page = dma_alloc_from_contiguous(dev, count, get_order(size));
+	if (!page)
+		page = alloc_pages_node(dev_to_node(dev), flag, get_order(size));
 	if (!page)
 		return NULL;

To free allocated buffer, one needs to call:

bool dma_release_from_contiguous(
	struct device *dev,
	struct page *pages,
	int count);

dev and count arguments are the same as before,whereas pages is whatdma_alloc_from_contiguous has returned.

If region passed to the function did not come from CMA, thefunction will return false. Otherwise, it will returntrue. This removes the need for higher-level functions totrack which allocations were made with CMA and which were made usingsome other method.

Again, here's how it is used on x86:

diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
@@ -126,6 +130,16 @@ again:
 	return page_address(page);
 }

+void dma_generic_free_coherent(struct device *dev, size_t size, void *vaddr,
+			       dma_addr_t dma_addr)
+{
+	unsigned int count = PAGE_ALIGN(size) >> PAGE_SHIFT;
+	struct page *page = virt_to_page(vaddr);
+
+	if (!dma_release_from_contiguous(dev, page, count))
+		free_pages((unsigned long)vaddr, get_order(size));
+}
+
 /*
  * See <Documentation/x86/x86_64/boot-options.txt> for the iommu kernel
  * parameter documentation.

Atomic allocations

Beware that dma_alloc_from_contiguous may not be calledfrom atomic context (e.g. when spin lock is hold or in an interrupt).It performs some “heavy” operations such as page migration, directreclaim, etc., which may take a while. Because of that, tomake dma_alloc_coherent and co. work as advertised,architecture needs to have a different method of allocating memory inatomic context.

The simplest solution is to put aside a bit of memory at boot timeand perform atomic allocations from it. This is in fact what ARM isdoing. Existing architectures most likely have a special path foratomic allocations already.

Special memory requirements

At this point, most of the drivers should “just work”. They usethe DMA API which calls CMA. Life is beautiful. Except some devicesmay have special memory requirements. For instance, Samsung'sMulti-format codec (MFC) requires different types of buffers to belocated in different memory banks (which allows reading them throughtwo memory channels, thus increasing memory bandwidth). Furthermore,one may want to separate some device's allocations from others as tolimit fragmentation within CMA areas.

As mentioned earlier, CMA operates on contexts describing a portionof system memory to allocate buffers from. One global area is createdto be used by devices by default, but if a device needs to usea different area, it can easily be done.

There is a many-to-one mapping between struct deviceand struct cma (ie. CMA context). This means, that ifa single device driver needs to use more than one CMA area, it has tohave separate struct device objects. At the same time,several struct device objects may point to the same CMAcontext.

Assigning CMA area to a single device

To assign a CMA area to a device, all one needs to do is call:

int dma_declare_contiguous(
	struct device *dev,
	unsigned long size,
	phys_addr_t base,
	phys_addr_t limit);

As withdma_contiguous_reserve, this needs to be called aftermemblock initializes but before too much memory gets grabbed from it.For ARM platforms, a convenient place to put invocation of thisfunction is machine's reserve callback.

The first argument of the function is the device that the newcontext is to be assigned to. The second is its size inbytes (not in pages). The third is physical address of the areaor zero. The last one has the same meaning as limit argumentto dma_contiguous_reserve. The return value is either zero(on success) or a negative error code.

For an example, one can take a look at the code called fromSamsung'sS5P platform's reserve callback. It creates two CMAcontexts for the MFC driver:

diff --git a/arch/arm/plat-s5p/dev-mfc.c b/arch/arm/plat-s5p/dev-mfc.c
@@ -22,52 +23,14 @@
 #include <plat/irqs.h>
 #include <plat/mfc.h>

[…]
 void __init s5p_mfc_reserve_mem(phys_addr_t rbase, unsigned int rsize,
 				phys_addr_t lbase, unsigned int lsize)
 {
	[…]
+	if (dma_declare_contiguous(&s5p_device_mfc_r.dev, rsize, rbase, 0))
+		printk(KERN_ERR "Failed to reserve memory for MFC device (%u bytes at 0x%08lx)\n",
+		       rsize, (unsigned long) rbase);
	[…]
+	if (dma_declare_contiguous(&s5p_device_mfc_l.dev, lsize, lbase, 0))
+		printk(KERN_ERR "Failed to reserve memory for MFC device (%u bytes at 0x%08lx)\n",
+		       rsize, (unsigned long) rbase);
 }

There is a limit to how many “private” areas can be declared,namely CONFIG_CMA_AREAS. Its default value is seven but itcan be safely increased if need arises. Called more times,dma_declare_contiguous function will print an errormessage and return -ENOSPC.

Assigning CMA area to multiple devices

Things get a bit more complicated if the same (not default) CMAcontext needs to be used by two or more devices. The current API doesnot provide a trivial way to do that. What can be done isuse dev_get_cma_area to figure out CMA area one device isusing, and dev_set_cma_area to set the same context toanother device. This sequence must be called no sooner thanin postcore_initcall. Here is how it could look like:

static int __init foo_set_up_cma_areas(void)
{
	struct cma *cma;

	cma = dev_get_cma_area(device1);
	dev_set_cma_area(device2, cma);
	return 0;
}
postcore_initcall(foo_set_up_cma_areas);

Of course, device1's area must be set up withdma_declare_contiguous as described in previoussubsection.

Device's CMA context may be changed any time as long as the devicehold no CMA memory – it will be rather tricky to release anyallocation after area change.

No default context

As a matter of fact, there is nothing special about the defaultcontext that is created by dma_contiguous_reserve function.It is in no way required and system may work without it.

If there is no default context, for devices without assignedareas dma_alloc_from_contiguous will returnNULL. dev_get_cma_area can be useddistinguish this situation and allocation failure.

Of course, if there is no default area, architecture should provideother means to allocate memory, for devices without assigned CMAcontext.

Size of the default context

dma_contiguous_reserve does not take a size as anargument, which brings a question of how does it know how much memoryshould be reserved. There are two sources this information comesfrom.

First of all, there is a set of Kconfig options, which specify thedefault size of the reservation. All of those options are locatedunder “Device Drivers” » “Generic Driver Options” » “Contiguous MemoryAllocator” in the Kconfig menu. They allow specifying one of fourpossible ways of calculating the size: it can be an absolute size inmegabytes, a percentage of total memory, the lower of the two, or thelarger of the two. By default is to allocate 16 MiBs of memory.

Second of all, there is a cma kernel command line option.It lets one specify the size of the area at boot time without the needto recompile the kernel. This option specifies the size in bytes andaccepts the usual suffixes.


the cma test code is as follows:

#include <linux/device.h>
#include <linux/dma-mapping.h>
#include <linux/fs.h>
#include <linux/miscdevice.h>
#include <linux/module.h>
#include <linux/slab.h>
#include <linux/spinlock.h>
#include <linux/sizes.h>
#include <linux/types.h>


struct cma_allocation {
	struct list_head list;
	size_t size;
	dma_addr_t dma;
	void *virt;
};

static struct device *cma_dev;
static LIST_HEAD(cma_allocations);
static DEFINE_SPINLOCK(cma_lock);

/*
* any read request will free the 1st allocated coherent memory, eg,
 cat /dev/cma_test
**/
static ssize_t
cma_test_read(struct file * file, char __user *buf, size_t count, loff_t *ppos)
{
	struct cma_allocation *alloc = NULL;
	spin_lock(&cma_lock);
	if(!list_empty(&cma_allocations)) {
		alloc = list_first_entry(&cma_allocations, 
					struct cma_allocation, list);
		list_del(&alloc->list);
	}

	spin_unlock(&cma_lock);

	if(!alloc)
	{
		return -EIDRM;
	}

	dma_free_coherent(cma_dev, alloc->size, alloc->virt, alloc->dma);
	_dev_info(cma_dev, "free: CM virt: %p dma: %p size:%zuK\n",
				alloc->virt, (void *)alloc->dma, alloc->size/ SZ_1K);
	kfree(alloc);
	return 0;
}


/*
* any write request will alloc a new coherent memory, eg,
* echo 1024 > /dev/cma_test
* will request 1024KiB by CMA
*/

static ssize_t 
cma_test_write(struct file *file, const char __user *buf, size_t count, loff_t *ppos)
{
	struct cma_allocation *alloc;
	unsigned long size;
	int ret;

	ret = kstrtoul_from_user(buf, count, 0, &size);
	if(ret)
		return ret;
	
	if(!size)
		return -EINVAL;

	if(size > ~(size_t)0 /SZ_1K)
		return -EOVERFLOW;
	
	alloc = kmalloc(sizeof *alloc, GFP_KERNEL);
	
	if(!alloc)
		return -ENOMEM;
	
	alloc->size = size *SZ_1K;
	alloc->virt = dma_alloc_coherent(cma_dev, alloc->size,
					&alloc->dma, GFP_KERNEL);

	if(alloc->virt) {
			_dev_info(cma_dev, "alloc:virt: %p dma: %p size: %zuK\n",
				alloc->virt, (void *)alloc->dma, alloc->size / SZ_1K);

		spin_lock(&cma_lock);
		list_add_tail(&alloc->list, &cma_allocations);
		spin_unlock(&cma_lock);
	
		return count;
	}
	else {
		dev_err(cma_dev, "no mem in CMA area\n");
		kfree(alloc);
		return -ENOSPC;
	}
}

static const struct file_operations cma_test_fops = {
		.owner =			THIS_MODULE,
		.read  =			cma_test_read,
		.write =			cma_test_write,
};

static struct miscdevice cma_test_misc = {
		.name = "cma_test",
		.fops = &cma_test_fops,
};


static int __init cma_test_init(void)
{
	int ret = misc_register(&cma_test_misc);
	if(unlikely(ret)) {
		pr_err("failed to register cma test misc device!\n");
		return ret;
	}
	
	cma_dev = cma_test_misc.this_device;
	cma_dev->coherent_dma_mask = ~0;
	_dev_info(cma_dev, "registered.\n");
	return 0;
}

module_init(cma_test_init);

static void __exit cma_test_exit(void)
{
	misc_deregister(&cma_test_misc);
}

module_exit(cma_test_exit);

MODULE_LICENSE("GPL");
MODULE_AUTHOR("Barry Song <Baohua.Song <at> <csr.com>");
MODULE_DESCRIPTION("kernel module to help the test of CMA");
MODULE_ALIAS("cma_test");

http://mina86.com/2012/06/10/deep-dive-into-contiguous-memory-allocator-part-i/


你可能感兴趣的:(kernel,CMA)