Peter's patch set works by assigning a "home node" to each process, then trying to concentrate the process and its memory on that node. Andrea's patch lacks the concept of home nodes; he thinks it is an idea that will not work well for programs that don't fit into a single node unless developers add code to use Peter's new system calls. Instead, Andrea would like NUMA scheduling to "just work" in the same way that transparent huge pages do. So his patch set seems to assume that resources will be spread out across the system; it then focuses on cleaning things up afterward. The key to the cleanup task is a bunch of statistics and a couple of new kernel threads.
The first of these threads is called knuma_scand. Its primary job is to scan through each process's address space, marking its in-RAM anonymous pages with a special set of bits that makes the pages look, to the hardware, like they are not present. If the process tries to access such a page, a page fault will result; the kernel will respond by marking the page "present" again so that the process can go about its business. But the kernel also tracks the node that the page lives on and the node the accessing process was running on, noting any mismatches. For each process, the kernel maintains an array of counters to track which node each of its recently-accessed pages were located on. For pages, the information tracked is necessarily more coarse; the kernel simply remembers the last node to access each page.
When the time comes for the scheduler to make a decision, it passes over the per-process statistics to determine whether the target process would be better off if it were moved to another node. If the process seems to be accessing most of its pages remotely, and it is better suited to the remote node than the processes already running there, it will be migrated over. This code drew a strenuous objection from Peter, who does not like the idea of putting a big for-each-CPU loop into the middle of the scheduler's hot path. After some light resistance, Andrea agreed that this logic eventually needs to find a different home where it would run less often. For testing, though, he likes things the way they are, since it causes the scheduler to converge more quickly on its chosen solution.
Moving processes around will only help so much, though, if their memory is spread across multiple NUMA nodes. Getting the best performance out of the system clearly requires a mechanism to gather pages of memory onto the same node as well. In the AutoNUMA patch, the first non-local fault (in response to the special page marking described above) will cause that page's "last node ID" value to be set to the accessing node; the page will also be queued to be migrated to that node. A subsequent fault from a different node will cancel that migration, though; after the first fault, two faults in a row from the same node are required to cause the page to be queued for migration.
Every NUMA node gets a new kernel thread (knuma_migrated) that is charged with passing over the lists of pages queued for migration and actually moving them to the target node. Migration is not unconditional - it depends, for example, on there being sufficient memory available on the destination node. But, most of the time, these migration threads should manage to pull pages toward the nodes where they are actually used.
Beyond the above-mentioned complaint about putting heavy computation into schedule(), Peter has found a number of things to dislike about this patch set. He doesn't like the worker threads, to begin with:
Andrea responds that the cost of these threads is small to the point that it cannot really be measured. It is a little harder to shrug off Peter's other complaint, though: that this patch set consumes a large amount of memory. The kernel maintains one struct page for every page of memory in the system. Since a typical system can have millions of pages, this structure must be kept as small as possible. But the AutoNUMA patch adds a list_head structure (for the migration queue) and two counters to each page structure. The end result can be a lot of memory lost to the AutoNUMA machinery.
The plan is to eventually move this information out of struct page; then, among other things, the kernel can avoid allocating it at all if AutoNUMA is not actually in use. But, for the NUMA case, that memory will still be consumed regardless of its location, and some users are unlikely to be happy even if others, as Andrea asserts, will be happy to give up a big chunk of memory if they get a 20% performance improvement in return. This looks like an argument that will not be settled in the near future, and, chances are, the memory impact of AutoNUMA will need to be reduced somehow. Perhaps, your editor naively suggests, knuma_migrated and its per-page list_head structure could be replaced by the "lazy migration" scheme used in Peter's patch.
NUMA scheduling is hard and doing it right requires significant expertise in both scheduling and memory management. So it seems like a good thing that the problem is receiving attention from some of the community's top scheduler and memory management developers. It may be that one or both of their solutions will be shown to be unworkable for some workloads to the point that it simply needs to be dropped. What may be more likely, though, is that these developers will eventually stop poking holes in each other's patches and, together, figure out how to combine the best aspects of each into a working solution that all can live with. What seems certain is that getting to that point will probably take some time.