Someone using centos 7 with 3.10.123.13.1 says there is a bug report:
[3550602.146301] kernel BUG at mm/memory.c:3765!
/* THP should already have been handled */
BUG_ON(pmd_numa(*pmd));
After checking the git log of mm/memory.c, we found exactly the fix for this problem:
commit 107437febd495a50e2cd09c81bbaa84d30e57b07
Author: Rik van Riel
Date: Tue Apr 29 15:36:15 2014 -0400
mm/numa: Remove BUG_ON() in __handle_mm_fault()
Changing PTEs and PMDs to pte_numa & pmd_numa is done with the
mmap_sem held for reading, which means a pmd can be instantiated
and turned into a numa one while __handle_mm_fault() is examining
the value of old_pmd.
If that happens, __handle_mm_fault() should just return and let
the page fault retry, instead of throwing an oops. This is
handled by the test for pmd_trans_huge(*pmd) below.
to a numa pmd when handle_mm_fault is checking the value of old_pmd,
so when we comes to fetch current pmd value by *pmd, it might be changed to a
numa one, so it is triggered. The key to this problem is that, either handle_mm_fault
or modification of numa_pmd is protected by down_read(mmap_sem), so there is no
exclusive lock between them, they are not aware of each others.
see comments of lock_pte_protection:
/*
* For a prot_numa update we only hold mmap_sem for read so there is a
* potential race with faulting where a pmd was temporarily none. This
* function checks for a transhuge pmd under the appropriate lock. It
* returns a pte if it was successfully locked or NULL if it raced with
* a transhuge insertion.
*/
for example, the following path:
do_migrate_pages->check_range->change_prot_numa->change_pte_range->ptep_set_numa
Then there is other bug reported such as:
[3557075.805696] kvm: zapping shadow pages for mmio generation wraparound
[3625477.198897] Adding 4194300k swap on /dev/mapper/os-swap. Priority:-1 extents:1 across:4194300k FS
[3626233.759633] traps: ceilometer-agen[3417] general protection ip:7f184c3f039e sp:7fff42d5fbc0 error:0 in libc-2.17.so[7f184c372000+1b6000]
[3626234.775485] BUG: Bad rss-counter state mm:ffff8810528b8640 idx:1 val:5
[3626234.775529] BUG: Bad rss-counter state mm:ffff8810528b8640 idx:2 val:-5
[3645969.174217] systemd-journald[1082]: Vacuuming done, freed 0 bytes
[3801900.193751] swap_free: Unused swap offset entry 00053980
[3801900.193773] BUG: Bad page map in process qemu-kvm pte:0a730000 pmd:4d6635067
[3801900.193791] addr:00007f4d20481000 vm_flags:a0120073 anon_vma:ffff88204d1d5840 mapping: (null) index:7f4d20481
[3801900.193820] CPU: 7 PID: 45513 Comm: qemu-kvm Not tainted 3.10.0-123.13.1.el7.x86_64 #1
[3801900.193822] Hardware name: To be filled by O.E.M. Tecal RH5885 V3/BC61BLCA, BIOS BLISV301 07/25/2014
[3801900.193824] 00007f4d20481000 00000000934bf4bb ffff88084c071998 ffffffff815e232e
[3801900.193834] ffff88084c0719e0 ffffffff81167b28 000000000a730000 00000007f4d20481
[3801900.193839] ffff8804d6635408 000000000a730000 00007f4d20481000 00007f4d20600000
[3801900.193844] Call Trace:
[3801900.193855] [] dump_stack+0x19/0x1b
[3801900.193862] [] print_bad_pte+0x1a8/0x240
[3801900.193866] [] unmap_page_range+0x5b0/0x860
[3801900.193870] [] unmap_single_vma+0x81/0xf0
[3801900.193873] [] unmap_vmas+0x49/0x90
[3801900.193877] [] exit_mmap+0xac/0x1a0
[3801900.193884] [] mmput+0x67/0xf0
[3801900.193903] [] vhost_dev_cleanup+0x231/0x260 [vhost_net]
[3801900.193908] [] vhost_net_release+0x4e/0xc0 [vhost_net]
[3801900.193915] [] __fput+0xe9/0x270
[3801900.193919] [] ____fput+0xe/0x10
[3801900.193925] [] task_work_run+0xc4/0xe0
[3801900.193928] [] do_exit+0x2cb/0xa60
[3801900.193934] [] ? __sigqueue_free.part.11+0x33/0x40
[3801900.193937] [] ? __dequeue_signal+0x13c/0x220
[3801900.193941] [] do_group_exit+0x3f/0xa0
[3801900.193955] [] get_signal_to_deliver+0x1d0/0x6e0
[3801900.193961] [] do_signal+0x57/0x600
[3801900.193965] [] ? do_send_sig_info+0x6c/0xa0
[3801900.193969] [] do_notify_resume+0x69/0xb0
[3801900.193974] [] int_signal+0x12/0x17
[3801900.193976] Disabling lock debugging due to kernel taint
[3626234.775485] BUG: Bad rss-counter state mm:ffff8810528b8640 idx:1 val:5
[3626234.775529] BUG: Bad rss-counter state mm:ffff8810528b8640 idx:2 val:-5
corresponding codes are:
__mmdrop->check_mm
static void check_mm(struct mm_struct *mm)
{
int i;
for (i = 0; i < NR_MM_COUNTERS; i++) {
long x = atomic_long_read(&mm->rss_stat.count[i]);
if (unlikely(x))
printk(KERN_ALERT "BUG: Bad rss-counter state "
"mm:%p idx:%d val:%ld\n", mm, i, x);
}
}
The code here is to check if all the resident page frames for mm have been
reclaimed, say, if a process need to exit, he must make sure all
MM_FILEPAGES, MM_ANONPAGES, MM_SWAPENTS have been
reclaimed. According to our test, the number of MM_ANONPAGES
is 5, whilc the number of MM_SWAPENTS is -5, this is ridiculous.
then we see:
[3801900.193751] swap_free: Unused swap offset entry 00053980
According to backtrace, this should happen when:
static unsigned long zap_pte_range(struct mmu_gather *tlb,
struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, unsigned long end,
struct zap_details *details)
{
int rss[NR_MM_COUNTERS];
do {
//in the memory, not swap
if (pte_present(ptent)) {
//normal pages pte_set(0)
}
//swap, if it is file page, it could only be a nonlinear page.
//else it is an anon page.
if (pte_file(ptent)) {
if (unlikely(!(vma->vm_flags & VM_NONLINEAR)))
print_bad_pte(vma, addr, ptent, NULL);
} else {
//entry is composed of (swap_type(swap area), swap_offset(pgoff inside swap_area))
swp_entry_t entry = pte_to_swp_entry(ptent);
if (unlikely(!free_swap_and_cache(entry)))
print_bad_pte(vma, addr, ptent, NULL);
}
}
}
The problem is that, swap_info_get(entry) in free_swap_and_cache(entry)
static struct swap_info_struct *swap_info_get(swp_entry_t entry)
{
struct swap_info_struct *p;
unsigned long offset, type;
if (!entry.val)
goto out;
type = swp_type(entry);
if (type >= nr_swapfiles)
goto bad_nofile;
p = swap_info[type];
if (!(p->flags & SWP_USED))
goto bad_device;
offset = swp_offset(entry);
if (offset >= p->max)
goto bad_offset;
if (!p->swap_map[offset])
goto bad_free;
spin_lock(&p->lock);
return p;
bad_free:
printk(KERN_ERR "swap_free: %s%08lx\n", Unused_offset, entry.val);
goto out;
bad_offset:
printk(KERN_ERR "swap_free: %s%08lx\n", Bad_offset, entry.val);
goto out;
bad_device:
printk(KERN_ERR "swap_free: %s%08lx\n", Unused_file, entry.val);
goto out;
bad_nofile:
printk(KERN_ERR "swap_free: %s%08lx\n", Bad_file, entry.val);
out:
return NULL;
}
since its value is 00053980, composed of swap type and swap offset.
So let's start the kdump parser.
crash /usr/lib/debug/lib/module/3.10.0-123.13.1.e7.x86_64/vmlinux vmcore
according to crash dmesg, the problem is firstly triggered when enable the swap partition:
[3625477.198897] Adding 4194300k swap on /dev/mapper/os-swap. Priority:-1 extents:1 across:4194300k FS
[3626233.759633] traps: ceilometer-agen[3417] general protection ip:7f184c3f039e sp:7fff42d5fbc0 error:0 in libc-2.17.so[7f184c372000+1b6000]
[3626234.775485] BUG: Bad rss-counter state mm:ffff8810528b8640 idx:1 val:5
[3626234.775529] BUG: Bad rss-counter state mm:ffff8810528b8640 idx:2 val:-5
[3645969.174217] systemd-journald[1082]: Vacuuming done, freed 0 bytes
[3801900.193751] swap_free: Unused swap offset entry 00053980
[3801900.193773] BUG: Bad page map in process qemu-kvm pte:0a730000 pmd:4d6635067
I've seen similar bugs reported at:
http://linux-kernel.2935.n7.nabble.com/unused-swap-offset-bad-page-map-td698180.html
however per Hugh's suggestion, this problem described above is due to incorrect setting of pte entry in meremap,
but in our 3.10 this bug has not been introduced yet, so we have to check other clues.
I just did a quick diff of 3.11-rc7/mm against 3.10, and here's
a line in mremap which worries me. That set_pte_at() is operating
on anything that isn't pte_none(), so the pte_mksoft_dirty() looks
prone to corrupt a swap entry.
> That said, google does find "swap_free: Unused swap offset entry"
> reports from over the years. Most of them seem to be single-bit
> errors, though (ie when the entry is 00000100 or similar I'm more
> inclined to blame a bit error
Yes, historically they have usually represented either single-bit
errors, or corruption of page tables by other kernel data. The
swap subsystem discovers it, but it's rarely an error of swap.
Let's launch the crash kdump, and we first check current running process
on each cpu, and we found the last panic process is kswapd, so:
crash> bt -a | grep kswap
PID: 476 TASK: ffff88085345c440 CPU: 27 COMMAND: "kswapd3"
#13 [ffff880852ca9e28] kswapd at ffffffff811572db
crash> set 476
PID: 476
COMMAND: "kswapd3"
TASK: ffff88085345c440 [THREAD_INFO: ffff880852ca8000]
CPU: 27
STATE: TASK_RUNNING (PANIC)
crash> bt
PID: 476 TASK: ffff88085345c440 CPU: 27 COMMAND: "kswapd3"
#0 [ffff880852ca96f0] machine_kexec at ffffffff810411c1
#1 [ffff880852ca9748] crash_kexec at ffffffff810cf2b2
#2 [ffff880852ca9818] oops_end at ffffffff815eae48
#3 [ffff880852ca9840] die at ffffffff8101632b
#4 [ffff880852ca9870] do_general_protection at ffffffff815ea74e
#5 [ffff880852ca98a0] general_protection at ffffffff815ea068
[exception RIP: down_read_trylock+9]
RIP: ffffffff8108a8e9 RSP: ffff880852ca9950 RFLAGS: 00010206
RAX: 0000000000000000 RBX: ffff88204d1d5840 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0600380100804d3c
RBP: ffff880852ca9950 R8: ffffffff81944ae0 R9: ffff880852ca9ba0
R10: 00000000000000f0 R11: ffffea006b203a40 R12: ffff88204d1d5841
R13: ffffea0078f117c0 R14: 0600380100804d3c R15: 0000000000000001
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#6 [ffff880852ca9958] page_lock_anon_vma_read at ffffffff811779c5
#7 [ffff880852ca9988] page_referenced at ffffffff81177c47
#8 [ffff880852ca9a00] shrink_page_list at ffffffff811541c4
#9 [ffff880852ca9b38] shrink_inactive_list at ffffffff81154e6a
#10 [ffff880852ca9c00] shrink_lruvec at ffffffff81155965
#11 [ffff880852ca9cf8] shrink_zone at ffffffff81155d76
#12 [ffff880852ca9d50] balance_pgdat at ffffffff8115701c
#13 [ffff880852ca9e28] kswapd at ffffffff811572db
#14 [ffff880852ca9ec8] kthread at ffffffff81085aff
#15 [ffff880852ca9f50] ret_from_fork at ffffffff815f296c
crash>
according to dmesg this is a 'general protection fault',
[4401153.261568] device tap292007ca-7d entered promiscuous mode
[4401153.285726] qbr292007ca-7d: port 2(tap292007ca-7d) entered forwarding state
[4401153.285754] qbr292007ca-7d: port 2(tap292007ca-7d) entered forwarding state
[4401161.284223] general protection fault: 0000 [#1] SMP
[4401161.284675] Modules linked in: vhost_net macvtap macvlan tun nf_conntrack_ipv6 nf_defrag_ipv6 xt_mac xt_physdev xt_multiport xt_set xt_conntrack iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_raw ip_set_hash_ip ip_set nfnetlink veth ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables openvswitch vxlan ip_tunnel gre 8021q garp mrp bridge stp llc bonding sg iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr sb_edac edac_core ixgbe i2c_i801 ses enclosure i2c_core ptp pps_core mdio mei_me lpc_ich dca mei mfd_core shpchp wmi acpi_cpufreq acpi_pad mperf nfsd auth_rpcgss nfs_acl lockd sunrpc xfs libcrc32c sd_mod lpfc crc32c_intel crc_t10dif
[4401161.289521] crct10dif_common scsi_transport_fc ahci libahci scsi_tgt libata megaraid_sas dm_mirror dm_region_hash dm_log dm_mod
[4401161.290475] CPU: 27 PID: 476 Comm: kswapd3 Tainted: G B -------------- 3.10.0-123.13.1.el7.x86_64 #1
[4401161.291334] Hardware name: To be filled by O.E.M. Tecal RH5885 V3/BC61BLCA, BIOS BLISV301 07/25/2014
[4401161.292217] task: ffff88085345c440 ti: ffff880852ca8000 task.ti: ffff880852ca8000
[4401161.293088] RIP: 0010:[] [] down_read_trylock+0x9/0x30
[4401161.294006] RSP: 0018:ffff880852ca9950 EFLAGS: 00010206
[4401161.294462] RAX: 0000000000000000 RBX: ffff88204d1d5840 RCX: 0000000000000000
[4401161.295298] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0600380100804d3c
[4401161.296172] RBP: ffff880852ca9950 R08: ffffffff81944ae0 R09: ffff880852ca9ba0
[4401161.297061] R10: 00000000000000f0 R11: ffffea006b203a40 R12: ffff88204d1d5841
[4401161.297949] R13: ffffea0078f117c0 R14: 0600380100804d3c R15: 0000000000000001
[4401161.298841] FS: 0000000000000000(0000) GS:ffff88207fc60000(0000) knlGS:0000000000000000
[4401161.299725] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[4401161.300173] CR2: 000000000154e578 CR3: 00000000018d0000 CR4: 00000000001427e0
[4401161.301057] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[4401161.301944] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[4401161.302828] Stack:
[4401161.303276] ffff880852ca9980 ffffffff811779c5 ffffea0078f117c0 0000000000000000
[4401161.304165] ffff880852ca9bb8 ffff88204d1d5841 ffff880852ca99f8 ffffffff81177c47
[4401161.305046] 0000000000000297 ffff880852ca9da8 ffff880852ca9ac8 0000000000000000
[4401161.305936] Call Trace:
[4401161.306380] [] page_lock_anon_vma_read+0x55/0x110
[4401161.306832] [] page_referenced+0x1c7/0x350
[4401161.307278] [] shrink_page_list+0x4b4/0xab0
[4401161.307724] [] shrink_inactive_list+0x1ea/0x580
[4401161.308159] [] shrink_lruvec+0x375/0x710
[4401161.308598] [] shrink_zone+0x76/0x1a0
[4401161.309039] [] balance_pgdat+0x48c/0x5e0
[4401161.309483] [] kswapd+0x16b/0x430
[4401161.309938] [] ? wake_up_bit+0x30/0x30
[4401161.310382] [] ? balance_pgdat+0x5e0/0x5e0
[4401161.310849] [] kthread+0xcf/0xe0
[4401161.311281] [] ? kthread_create_on_node+0x140/0x140
[4401161.311765] [] ret_from_fork+0x7c/0xb0
[4401161.312256] [] ? kthread_create_on_node+0x140/0x140
[4401161.312728] Code: 00 00 00 ba 01 00 00 00 48 89 de e8 12 fe ff ff eb ce 48 c7 c0 f2 ff ff ff eb c5 e8 02 34 fd ff 66 90 0f 1f 44 00 00 55 48 89 e5 <48> 8b 07 48 89 c2 48 83 c2 01 7e 07 f0 48 0f b1 17 75 f0 48 f7
[4401161.314361] RIP [] down_read_trylock+0x9/0x30
[4401161.314836] RSP
let's dis the function of page_lock_anon_vma_read and we get:
0xffffffff81177970 : nopl 0x0(%rax,%rax,1) [FTRACE NOP]
0xffffffff81177975 : push %rbp
0xffffffff81177976 : mov %rsp,%rbp
0xffffffff81177979 : push %r14
0xffffffff8117797b : push %r13
0xffffffff8117797d : mov %rdi,%r13
0xffffffff81177980 : push %r12
0xffffffff81177982 : push %rbx
0xffffffff81177983 : mov 0x8(%rdi),%r12
0xffffffff81177987 : mov %r12,%rax
0xffffffff8117798a : and $0x3,%eax
0xffffffff8117798d : cmp $0x1,%rax
0xffffffff81177991 : je 0xffffffff811779a8
0xffffffff81177993 : xor %ebx,%ebx
0xffffffff81177995 : mov %rbx,%rax
0xffffffff81177998 : pop %rbx
0xffffffff81177999 : pop %r12
0xffffffff8117799b : pop %r13
0xffffffff8117799d : pop %r14
0xffffffff8117799f : pop %rbp
0xffffffff811779a0 : retq
0xffffffff811779a1 : nopl 0x0(%rax)
0xffffffff811779a8 : mov 0x18(%rdi),%eax
0xffffffff811779ab : test %eax,%eax
0xffffffff811779ad : js 0xffffffff81177993
0xffffffff811779af : mov -0x1(%r12),%r14
0xffffffff811779b4 : lea -0x1(%r12),%rbx
0xffffffff811779b9 : add $0x8,%r14
0xffffffff811779bd : mov %r14,%rdi
0xffffffff811779c0 : callq 0xffffffff8108a8e0
0xffffffff811779c5 : test %eax,%eax
let's backtrace page_lock_anon_vma_read:
0xffffffff81177983 : mov 0x8(%rdi),%r12
0xffffffff81177987 : mov %r12,%rax
anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
struct anon_vma *page_lock_anon_vma_read(struct page *page)
struct page {
/* First double word block */
unsigned long flags; /* Atomic flags, some possibly
* updated asynchronously */
struct address_space *mapping; /* If low bit clear, points to
so r12 comes from unsigned long anon_mapping, thus r12-1 is value of anon_vma.
0xffffffff811779af : mov -0x1(%r12),%r14
which is
root_anon_vma = ACCESS_ONCE(anon_vma->root);
because
1.anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
PAGE_MAPPING_ANON equals 1.
2. we acuires value from r12 pointer.
the r14 is the value of root_anon_vmathen
0xffffffff811779b9 : add $0x8,%r14
0xffffffff811779bd : mov %r14,%rdi
struct anon_vma {
struct anon_vma *root; /* Root of this anon_vma tree */
struct rw_semaphore rwsem; /* W: modification, R: walking the list */
so r14 and rdi is the address of rwsem embedded in root_anon_vma.
and the panic place for EIP:
RIP: 0010:[] [] down_read_trylock+0x9/0x30
down_read_trylock(&root_anon_vma->rwsem)
it is trying to access edi,
0xffffffff8108a8e9 : mov (%rdi),%rax
static inline int __down_read_trylock(struct rw_semaphore *sem)
{
long result, tmp;
asm volatile("# beginning __down_read_trylock\n\t"
" mov %0,%1\n\t"
"1:\n\t"
" mov %1,%2\n\t"
" add %3,%2\n\t"
" jle 2f\n\t"
LOCK_PREFIX " cmpxchg %2,%0\n\t"
" jnz 1b\n\t"
"2:\n\t"
"# ending __down_read_trylock\n\t"
: "+m" (sem->count), "=&a" (result), "=&r" (tmp)
: "i" (RWSEM_ACTIVE_READ_BIAS)
: "memory", "cc");
return result >= 0 ? 1 : 0;
}
so the panic IP is 'move %0,%1' thus trying to put sem->count to result, but
sem is invalid, according to register context,we know
edi is 0600380100804d3c, so it means sem is 0600380100804d3c, this is absolutely wrong.
since the r14 comes from r12, let's check r12-1 value , which is the value of
struct anon_vma *root_anon_vma, and r12 is ffff88204d1d5841:
crash> rd ffff88204d1d5840
ffff88204d1d5840: 0600380100804d34 4M...8..
After plus 8 it just right equals the bad addr of r14: 0600380100804d3c
WTF? obviously there is memory overlap around ffff88204d1d5841, let's check this struct:
struct anon_vma {
root = 0x600380100804d34,
rwsem = {
count = 1099511654660,
wait_lock = {
raw_lock = {
{
head_tail = 0,
tickets = {
head = 0,
tail = 0
}
}
}
},
wait_list = {
next = 0x0,
prev = 0x996e000000000020
}
},
refcount = {
counter = 680270649
},
rb_root = {
rb_node = 0x84b9568a801bb81c
}
}
which is quite confusing, let's rd this address range:
crash> struct anon_vma -o
struct anon_vma {
[0] struct anon_vma *root;
[8] struct rw_semaphore rwsem;
[40] atomic_t refcount;
[48] struct rb_root rb_root;
}
SIZE: 56
crash> rd ffff88204d1d5840 38
ffff88204d1d5840: 0600380100804d34 0000010000006904 4M...8...i......
ffff88204d1d5850: 0000000000000000 0000000000000000 ................
ffff88204d1d5860: 996e000000000020 84b92543288c1b39 .....n.9..(C%..
ffff88204d1d5870: 84b9568a801bb81c 0000568a801bb81c .....V.......V..
so anon_vma is corrupted.
Since anon_vma is highly connected with swap, so we check the counter of swap_map again,
let's figure out how swap_map increases.
initial value for swap_map entry is zero, when a process's pages are swaped out
by shrink_list, a spare swap slot must be found out to store the page content.
static unsigned long shrink_page_list(struct list_head *page_list)
{
spin_lock(&swap_lock);
while (!list_empty(page_list)) {
//it is a anon page needed to be swaped out
if (PageAnon(page) && !PageSwapCache(page)) {
//try to find a free swap entry
//check each swap area
for (type = swap_list.next) {
//try to find a free swap entry inside current swap area
si = swap_info[type];
spin_lock(&si->lock);
spin_unlock(&swap_lock);
//this is extremely obscure to find a free page slot.
offset = scan_swap_map(si, SWAP_HAS_CACHE);
spin_unlock(&si->lock);
}
SetPageDirty(page);
}
//update pte with swap_entry
try_to_unmap(page);
//it is dirty set by previous step
pageout(page);
}
}
Besides swap error, there is also a suspected warning on mm->rss_conter:
[3626234.775485] BUG: Bad rss-counter state mm:ffff8810528b8640 idx:1 val:5
[3626234.775529] BUG: Bad rss-counter state mm:ffff8810528b8640 idx:2 val:-5
It is too hard for us to track the root cause because the problem maybe a chain reaction for other
bugs. And according to search result, here is an possible fix for this issue:
http://www.serverphorums.com/read.php?12,950567
which says:
I saw a bunch of similar errors on 3.14.x up to and including 3.14.4,
running Java (Tomcat) and Postgres on Xen PV. Have not seen it since
"mm: use paravirt friendly ops for NUMA hinting ptes" landed in 3.14.5.
402e194dfc5b38d99f9c65b86e2666b29adebf8c in stable,
29c7787075c92ca8af353acd5301481e6f37082f upstream
commit 29c7787075c92ca8af353acd5301481e6f37082f
Author: Mel Gorman
Date: Fri Apr 18 15:07:21 2014 -0700
mm: use paravirt friendly ops for NUMA hinting ptes
David Vrabel identified a regression when using automatic NUMA balancing
under Xen whereby page table entries were getting corrupted due to the
use of native PTE operations. Quoting him
Xen PV guest page tables require that their entries use machine
addresses if the preset bit (_PAGE_PRESENT) is set, and (for
successful migration) non-present PTEs must use pseudo-physical
addresses. This is because on migration MFNs in present PTEs are
translated to PFNs (canonicalised) so they may be translated back
to the new MFN in the destination domain (uncanonicalised).
pte_mknonnuma(), pmd_mknonnuma(), pte_mknuma() and pmd_mknuma()
set and clear the _PAGE_PRESENT bit using pte_set_flags(),
pte_clear_flags(), etc.
In a Xen PV guest, these functions must translate MFNs to PFNs
when clearing _PAGE_PRESENT and translate PFNs to MFNs when setting
_PAGE_PRESENT.
His suggested fix converted p[te|md]_[set|clear]_flags to using
paravirt-friendly ops but this is overkill. He suggested an alternative
of using p[te|md]_modify in the NUMA page table operations but this is
does more work than necessary and would require looking up a VMA for
protections.
This patch modifies the NUMA page table operations to use paravirt
friendly operations to set/clear the flags of interest. Unfortunately
this will take a performance hit when updating the PTEs on
CONFIG_PARAVIRT but I do not see a way around it that does not break
Xen.