[知其然不知其所以然-18] handle_mm_fault, bad page等错误现场分析

Someone using centos 7 with 3.10.123.13.1 says there is a bug report:

[3550602.146301] kernel BUG at mm/memory.c:3765!

/* THP should already have been handled */
BUG_ON(pmd_numa(*pmd));

After checking the git log of mm/memory.c, we found exactly the fix for this problem:

commit 107437febd495a50e2cd09c81bbaa84d30e57b07
Author: Rik van Riel 
Date:   Tue Apr 29 15:36:15 2014 -0400

    mm/numa: Remove BUG_ON() in __handle_mm_fault()

    Changing PTEs and PMDs to pte_numa & pmd_numa is done with the
    mmap_sem held for reading, which means a pmd can be instantiated
    and turned into a numa one while __handle_mm_fault() is examining
    the value of old_pmd.

    If that happens, __handle_mm_fault() should just return and let
    the page fault retry, instead of throwing an oops. This is
    handled by the test for pmd_trans_huge(*pmd) below.

The main scenario here is that, someone may modify the attribute of a normal pmd

to a numa pmd when handle_mm_fault is checking the value of old_pmd,

so when we comes to fetch current pmd value by *pmd, it might be changed to a

numa one, so it is triggered. The key to this problem is that, either handle_mm_fault

or modification of numa_pmd is protected by down_read(mmap_sem), so there is no

exclusive lock between them, they are not aware of each others.

see comments of lock_pte_protection:

/*
 * For a prot_numa update we only hold mmap_sem for read so there is a
 * potential race with faulting where a pmd was temporarily none. This
 * function checks for a transhuge pmd under the appropriate lock. It
 * returns a pte if it was successfully locked or NULL if it raced with
 * a transhuge insertion.
 */
for example, the following path:

do_migrate_pages->check_range->change_prot_numa->change_pte_range->ptep_set_numa

Then there is other bug reported such as:

[3557075.805696] kvm: zapping shadow pages for mmio generation wraparound
[3625477.198897] Adding 4194300k swap on /dev/mapper/os-swap.  Priority:-1 extents:1 across:4194300k FS
[3626233.759633] traps: ceilometer-agen[3417] general protection ip:7f184c3f039e sp:7fff42d5fbc0 error:0 in libc-2.17.so[7f184c372000+1b6000]
[3626234.775485] BUG: Bad rss-counter state mm:ffff8810528b8640 idx:1 val:5
[3626234.775529] BUG: Bad rss-counter state mm:ffff8810528b8640 idx:2 val:-5
[3645969.174217] systemd-journald[1082]: Vacuuming done, freed 0 bytes
[3801900.193751] swap_free: Unused swap offset entry 00053980
[3801900.193773] BUG: Bad page map in process qemu-kvm  pte:0a730000 pmd:4d6635067
[3801900.193791] addr:00007f4d20481000 vm_flags:a0120073 anon_vma:ffff88204d1d5840 mapping:          (null) index:7f4d20481
[3801900.193820] CPU: 7 PID: 45513 Comm: qemu-kvm Not tainted 3.10.0-123.13.1.el7.x86_64 #1
[3801900.193822] Hardware name: To be filled by O.E.M. Tecal RH5885 V3/BC61BLCA, BIOS BLISV301 07/25/2014
[3801900.193824]  00007f4d20481000 00000000934bf4bb ffff88084c071998 ffffffff815e232e
[3801900.193834]  ffff88084c0719e0 ffffffff81167b28 000000000a730000 00000007f4d20481
[3801900.193839]  ffff8804d6635408 000000000a730000 00007f4d20481000 00007f4d20600000
[3801900.193844] Call Trace:
[3801900.193855]  [] dump_stack+0x19/0x1b
[3801900.193862]  [] print_bad_pte+0x1a8/0x240
[3801900.193866]  [] unmap_page_range+0x5b0/0x860
[3801900.193870]  [] unmap_single_vma+0x81/0xf0
[3801900.193873]  [] unmap_vmas+0x49/0x90
[3801900.193877]  [] exit_mmap+0xac/0x1a0
[3801900.193884]  [] mmput+0x67/0xf0
[3801900.193903]  [] vhost_dev_cleanup+0x231/0x260 [vhost_net]
[3801900.193908]  [] vhost_net_release+0x4e/0xc0 [vhost_net]
[3801900.193915]  [] __fput+0xe9/0x270
[3801900.193919]  [] ____fput+0xe/0x10
[3801900.193925]  [] task_work_run+0xc4/0xe0
[3801900.193928]  [] do_exit+0x2cb/0xa60
[3801900.193934]  [] ? __sigqueue_free.part.11+0x33/0x40
[3801900.193937]  [] ? __dequeue_signal+0x13c/0x220
[3801900.193941]  [] do_group_exit+0x3f/0xa0
[3801900.193955]  [] get_signal_to_deliver+0x1d0/0x6e0
[3801900.193961]  [] do_signal+0x57/0x600
[3801900.193965]  [] ? do_send_sig_info+0x6c/0xa0
[3801900.193969]  [] do_notify_resume+0x69/0xb0
[3801900.193974]  [] int_signal+0x12/0x17
[3801900.193976] Disabling lock debugging due to kernel taint

First , let;s check

[3626234.775485] BUG: Bad rss-counter state mm:ffff8810528b8640 idx:1 val:5
[3626234.775529] BUG: Bad rss-counter state mm:ffff8810528b8640 idx:2 val:-5
corresponding codes are:

__mmdrop->check_mm

static void check_mm(struct mm_struct *mm)
{
	int i;

	for (i = 0; i < NR_MM_COUNTERS; i++) {
		long x = atomic_long_read(&mm->rss_stat.count[i]);

		if (unlikely(x))
			printk(KERN_ALERT "BUG: Bad rss-counter state "
					  "mm:%p idx:%d val:%ld\n", mm, i, x);
	}
}
The code here is to check if all the resident page frames for mm have been

reclaimed, say, if a process need to exit, he must make sure all

MM_FILEPAGES, MM_ANONPAGES, MM_SWAPENTS have been

reclaimed. According to our test, the number of MM_ANONPAGES

is 5, whilc the number of MM_SWAPENTS  is -5, this is ridiculous.
then we see:

[3801900.193751] swap_free: Unused swap offset entry 00053980
According to backtrace, this should happen when:

static unsigned long zap_pte_range(struct mmu_gather *tlb,
				struct vm_area_struct *vma, pmd_t *pmd,
				unsigned long addr, unsigned long end,
				struct zap_details *details)
{
	int rss[NR_MM_COUNTERS];

	do {
		//in the memory, not swap
		if (pte_present(ptent)) {
			//normal pages pte_set(0)
		}
		//swap, if it is file page, it could only be a nonlinear page.
		//else it is an anon page.
		if (pte_file(ptent)) {
			if (unlikely(!(vma->vm_flags & VM_NONLINEAR)))
				print_bad_pte(vma, addr, ptent, NULL);
		} else {
			//entry is composed of (swap_type(swap area), swap_offset(pgoff inside swap_area))
			swp_entry_t entry = pte_to_swp_entry(ptent);
			if (unlikely(!free_swap_and_cache(entry)))
				print_bad_pte(vma, addr, ptent, NULL);
		}
	}
}
The problem is that, swap_info_get(entry) in free_swap_and_cache(entry)
finds the counter of swap_map for this swap page is zero, which means,
this swap slot has already been released by others:

static struct swap_info_struct *swap_info_get(swp_entry_t entry)
{
	struct swap_info_struct *p;
	unsigned long offset, type;

	if (!entry.val)
		goto out;
	type = swp_type(entry);
	if (type >= nr_swapfiles)
		goto bad_nofile;
	p = swap_info[type];
	if (!(p->flags & SWP_USED))
		goto bad_device;
	offset = swp_offset(entry);
	if (offset >= p->max)
		goto bad_offset;
	if (!p->swap_map[offset])
		goto bad_free;
	spin_lock(&p->lock);
	return p;

bad_free:
	printk(KERN_ERR "swap_free: %s%08lx\n", Unused_offset, entry.val);
	goto out;
bad_offset:
	printk(KERN_ERR "swap_free: %s%08lx\n", Bad_offset, entry.val);
	goto out;
bad_device:
	printk(KERN_ERR "swap_free: %s%08lx\n", Unused_file, entry.val);
	goto out;
bad_nofile:
	printk(KERN_ERR "swap_free: %s%08lx\n", Bad_file, entry.val);
out:
	return NULL;
}

since its value is 00053980, composed of swap type and swap offset.

So let's start the kdump parser.

crash /usr/lib/debug/lib/module/3.10.0-123.13.1.e7.x86_64/vmlinux vmcore

according to crash dmesg, the problem is firstly triggered when enable the swap partition:

[3625477.198897] Adding 4194300k swap on /dev/mapper/os-swap.  Priority:-1 extents:1 across:4194300k FS
[3626233.759633] traps: ceilometer-agen[3417] general protection ip:7f184c3f039e sp:7fff42d5fbc0 error:0 in libc-2.17.so[7f184c372000+1b6000]
[3626234.775485] BUG: Bad rss-counter state mm:ffff8810528b8640 idx:1 val:5
[3626234.775529] BUG: Bad rss-counter state mm:ffff8810528b8640 idx:2 val:-5
[3645969.174217] systemd-journald[1082]: Vacuuming done, freed 0 bytes
[3801900.193751] swap_free: Unused swap offset entry 00053980
[3801900.193773] BUG: Bad page map in process qemu-kvm  pte:0a730000 pmd:4d6635067

let' check swap_info array, which contains the swap areas in the system.


I've seen similar bugs reported at:

http://linux-kernel.2935.n7.nabble.com/unused-swap-offset-bad-page-map-td698180.html

however per Hugh's suggestion, this problem described above is due to incorrect setting of pte entry in meremap,

but in our 3.10  this bug has not been introduced yet, so we have to check other clues.

I just did a quick diff of 3.11-rc7/mm against 3.10, and here's 
a line in mremap which worries me.  That set_pte_at() is operating 
on anything that isn't pte_none(), so the pte_mksoft_dirty() looks 
prone to corrupt a swap entry. 

and also per Linus and Hugh's experience:

> That said, google does find "swap_free: Unused swap offset entry" 
> reports from over the years. Most of them seem to be single-bit 
> errors, though (ie when the entry is 00000100 or similar I'm more 
> inclined to blame a bit error 

Yes, historically they have usually represented either single-bit 
errors, or corruption of page tables by other kernel data.  The 
swap subsystem discovers it, but it's rarely an error of swap. 

Well, it looks like we have some trouble on finding out the crime culprit.


Let's launch the crash kdump, and we first check current running process

on each cpu, and we found the last panic process is kswapd, so:

crash> bt -a | grep kswap
PID: 476    TASK: ffff88085345c440  CPU: 27  COMMAND: "kswapd3"
#13 [ffff880852ca9e28] kswapd at ffffffff811572db
crash> set 476
    PID: 476
COMMAND: "kswapd3"
   TASK: ffff88085345c440  [THREAD_INFO: ffff880852ca8000]
    CPU: 27
  STATE: TASK_RUNNING (PANIC)
crash> bt
PID: 476    TASK: ffff88085345c440  CPU: 27  COMMAND: "kswapd3"
 #0 [ffff880852ca96f0] machine_kexec at ffffffff810411c1
 #1 [ffff880852ca9748] crash_kexec at ffffffff810cf2b2
 #2 [ffff880852ca9818] oops_end at ffffffff815eae48
 #3 [ffff880852ca9840] die at ffffffff8101632b
 #4 [ffff880852ca9870] do_general_protection at ffffffff815ea74e
 #5 [ffff880852ca98a0] general_protection at ffffffff815ea068
    [exception RIP: down_read_trylock+9]
    RIP: ffffffff8108a8e9  RSP: ffff880852ca9950  RFLAGS: 00010206
    RAX: 0000000000000000  RBX: ffff88204d1d5840  RCX: 0000000000000000
    RDX: 0000000000000000  RSI: 0000000000000001  RDI: 0600380100804d3c
    RBP: ffff880852ca9950   R8: ffffffff81944ae0   R9: ffff880852ca9ba0
    R10: 00000000000000f0  R11: ffffea006b203a40  R12: ffff88204d1d5841
    R13: ffffea0078f117c0  R14: 0600380100804d3c  R15: 0000000000000001
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #6 [ffff880852ca9958] page_lock_anon_vma_read at ffffffff811779c5
 #7 [ffff880852ca9988] page_referenced at ffffffff81177c47
 #8 [ffff880852ca9a00] shrink_page_list at ffffffff811541c4
 #9 [ffff880852ca9b38] shrink_inactive_list at ffffffff81154e6a
#10 [ffff880852ca9c00] shrink_lruvec at ffffffff81155965
#11 [ffff880852ca9cf8] shrink_zone at ffffffff81155d76
#12 [ffff880852ca9d50] balance_pgdat at ffffffff8115701c
#13 [ffff880852ca9e28] kswapd at ffffffff811572db
#14 [ffff880852ca9ec8] kthread at ffffffff81085aff
#15 [ffff880852ca9f50] ret_from_fork at ffffffff815f296c
crash>
according to dmesg this is a 'general protection fault',

[4401153.261568] device tap292007ca-7d entered promiscuous mode
[4401153.285726] qbr292007ca-7d: port 2(tap292007ca-7d) entered forwarding state
[4401153.285754] qbr292007ca-7d: port 2(tap292007ca-7d) entered forwarding state
[4401161.284223] general protection fault: 0000 [#1] SMP 
[4401161.284675] Modules linked in: vhost_net macvtap macvlan tun nf_conntrack_ipv6 nf_defrag_ipv6 xt_mac xt_physdev xt_multiport xt_set xt_conntrack iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_raw ip_set_hash_ip ip_set nfnetlink veth ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables openvswitch vxlan ip_tunnel gre 8021q garp mrp bridge stp llc bonding sg iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr sb_edac edac_core ixgbe i2c_i801 ses enclosure i2c_core ptp pps_core mdio mei_me lpc_ich dca mei mfd_core shpchp wmi acpi_cpufreq acpi_pad mperf nfsd auth_rpcgss nfs_acl lockd sunrpc xfs libcrc32c sd_mod lpfc crc32c_intel crc_t10dif
[4401161.289521]  crct10dif_common scsi_transport_fc ahci libahci scsi_tgt libata megaraid_sas dm_mirror dm_region_hash dm_log dm_mod
[4401161.290475] CPU: 27 PID: 476 Comm: kswapd3 Tainted: G    B       --------------   3.10.0-123.13.1.el7.x86_64 #1
[4401161.291334] Hardware name: To be filled by O.E.M. Tecal RH5885 V3/BC61BLCA, BIOS BLISV301 07/25/2014
[4401161.292217] task: ffff88085345c440 ti: ffff880852ca8000 task.ti: ffff880852ca8000
[4401161.293088] RIP: 0010:[]  [] down_read_trylock+0x9/0x30
[4401161.294006] RSP: 0018:ffff880852ca9950  EFLAGS: 00010206
[4401161.294462] RAX: 0000000000000000 RBX: ffff88204d1d5840 RCX: 0000000000000000
[4401161.295298] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0600380100804d3c
[4401161.296172] RBP: ffff880852ca9950 R08: ffffffff81944ae0 R09: ffff880852ca9ba0
[4401161.297061] R10: 00000000000000f0 R11: ffffea006b203a40 R12: ffff88204d1d5841
[4401161.297949] R13: ffffea0078f117c0 R14: 0600380100804d3c R15: 0000000000000001
[4401161.298841] FS:  0000000000000000(0000) GS:ffff88207fc60000(0000) knlGS:0000000000000000
[4401161.299725] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[4401161.300173] CR2: 000000000154e578 CR3: 00000000018d0000 CR4: 00000000001427e0
[4401161.301057] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[4401161.301944] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[4401161.302828] Stack:
[4401161.303276]  ffff880852ca9980 ffffffff811779c5 ffffea0078f117c0 0000000000000000
[4401161.304165]  ffff880852ca9bb8 ffff88204d1d5841 ffff880852ca99f8 ffffffff81177c47
[4401161.305046]  0000000000000297 ffff880852ca9da8 ffff880852ca9ac8 0000000000000000
[4401161.305936] Call Trace:
[4401161.306380]  [] page_lock_anon_vma_read+0x55/0x110
[4401161.306832]  [] page_referenced+0x1c7/0x350
[4401161.307278]  [] shrink_page_list+0x4b4/0xab0
[4401161.307724]  [] shrink_inactive_list+0x1ea/0x580
[4401161.308159]  [] shrink_lruvec+0x375/0x710
[4401161.308598]  [] shrink_zone+0x76/0x1a0
[4401161.309039]  [] balance_pgdat+0x48c/0x5e0
[4401161.309483]  [] kswapd+0x16b/0x430
[4401161.309938]  [] ? wake_up_bit+0x30/0x30
[4401161.310382]  [] ? balance_pgdat+0x5e0/0x5e0
[4401161.310849]  [] kthread+0xcf/0xe0
[4401161.311281]  [] ? kthread_create_on_node+0x140/0x140
[4401161.311765]  [] ret_from_fork+0x7c/0xb0
[4401161.312256]  [] ? kthread_create_on_node+0x140/0x140
[4401161.312728] Code: 00 00 00 ba 01 00 00 00 48 89 de e8 12 fe ff ff eb ce 48 c7 c0 f2 ff ff ff eb c5 e8 02 34 fd ff 66 90 0f 1f 44 00 00 55 48 89 e5 <48> 8b 07 48 89 c2 48 83 c2 01 7e 07 f0 48 0f b1 17 75 f0 48 f7 
[4401161.314361] RIP  [] down_read_trylock+0x9/0x30
[4401161.314836]  RSP 


let's dis the function of page_lock_anon_vma_read and we get:

0xffffffff81177970 :   nopl   0x0(%rax,%rax,1) [FTRACE NOP]
0xffffffff81177975 : push   %rbp
0xffffffff81177976 : mov    %rsp,%rbp
0xffffffff81177979 : push   %r14
0xffffffff8117797b :        push   %r13
0xffffffff8117797d :        mov    %rdi,%r13
0xffffffff81177980 :        push   %r12
0xffffffff81177982 :        push   %rbx
0xffffffff81177983 :        mov    0x8(%rdi),%r12
0xffffffff81177987 :        mov    %r12,%rax
0xffffffff8117798a :        and    $0x3,%eax
0xffffffff8117798d :        cmp    $0x1,%rax
0xffffffff81177991 :        je     0xffffffff811779a8 
0xffffffff81177993 :        xor    %ebx,%ebx
0xffffffff81177995 :        mov    %rbx,%rax
0xffffffff81177998 :        pop    %rbx
0xffffffff81177999 :        pop    %r12
0xffffffff8117799b :        pop    %r13
0xffffffff8117799d :        pop    %r14
0xffffffff8117799f :        pop    %rbp
0xffffffff811779a0 :        retq
0xffffffff811779a1 :        nopl   0x0(%rax)
0xffffffff811779a8 :        mov    0x18(%rdi),%eax
0xffffffff811779ab :        test   %eax,%eax
0xffffffff811779ad :        js     0xffffffff81177993 
0xffffffff811779af :        mov    -0x1(%r12),%r14
0xffffffff811779b4 :        lea    -0x1(%r12),%rbx
0xffffffff811779b9 :        add    $0x8,%r14
0xffffffff811779bd :        mov    %r14,%rdi
0xffffffff811779c0 :        callq  0xffffffff8108a8e0 
0xffffffff811779c5 :        test   %eax,%eax


 let's backtrace page_lock_anon_vma_read:

0xffffffff81177983 :        mov    0x8(%rdi),%r12

 
  

0xffffffff81177987 :        mov    %r12,%rax
anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
struct anon_vma *page_lock_anon_vma_read(struct page *page)
struct page {
	/* First double word block */
	unsigned long flags;		/* Atomic flags, some possibly
					 * updated asynchronously */
	struct address_space *mapping;	/* If low bit clear, points to
so r12 comes from unsigned long anon_mapping, thus r12-1 is value of anon_vma.

then 

0xffffffff811779af :        mov    -0x1(%r12),%r14
which is 

root_anon_vma = ACCESS_ONCE(anon_vma->root);
because

1.anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
   PAGE_MAPPING_ANON equals 1.  

2. we acuires value from r12 pointer.

the r14 is the value of root_anon_vma


then

0xffffffff811779b9 :        add    $0x8,%r14

0xffffffff811779bd :        mov    %r14,%rdi
struct anon_vma {
	struct anon_vma *root;		/* Root of this anon_vma tree */
	struct rw_semaphore rwsem;	/* W: modification, R: walking the list */
so r14 and rdi is the address of rwsem embedded in root_anon_vma.


and the panic place for EIP:

RIP: 0010:[]  [] down_read_trylock+0x9/0x30

down_read_trylock(&root_anon_vma->rwsem)

it is trying to access edi, 

0xffffffff8108a8e9 :       mov    (%rdi),%rax

static inline int __down_read_trylock(struct rw_semaphore *sem)
{
	long result, tmp;
	asm volatile("# beginning __down_read_trylock\n\t"
		     "  mov          %0,%1\n\t"
		     "1:\n\t"
		     "  mov          %1,%2\n\t"
		     "  add          %3,%2\n\t"
		     "  jle	     2f\n\t"
		     LOCK_PREFIX "  cmpxchg  %2,%0\n\t"
		     "  jnz	     1b\n\t"
		     "2:\n\t"
		     "# ending __down_read_trylock\n\t"
		     : "+m" (sem->count), "=&a" (result), "=&r" (tmp)
		     : "i" (RWSEM_ACTIVE_READ_BIAS)
		     : "memory", "cc");
	return result >= 0 ? 1 : 0;
}
so the panic IP is 'move %0,%1'  thus trying to put sem->count to result, but

sem is invalid, according to register context,we know

edi is 0600380100804d3c,  so it means sem is 0600380100804d3c, this is absolutely wrong.


since the r14 comes from r12, let's check r12-1 value , which is the value of 

struct anon_vma *root_anon_vma, and r12 is ffff88204d1d5841:

crash> rd ffff88204d1d5840
ffff88204d1d5840:  0600380100804d34                    4M...8..
After plus 8 it just right equals the bad addr of r14: 0600380100804d3c 

WTF? obviously there is memory overlap around ffff88204d1d5841, let's check this struct:

struct anon_vma {
  root = 0x600380100804d34,
  rwsem = {
    count = 1099511654660,
    wait_lock = {
      raw_lock = {
        {
          head_tail = 0,
          tickets = {
            head = 0,
            tail = 0
          }
        }
      }
    },
    wait_list = {
      next = 0x0,
      prev = 0x996e000000000020
    }
  },
  refcount = {
    counter = 680270649
  },
  rb_root = {
    rb_node = 0x84b9568a801bb81c
  }
}
which is quite confusing, let's rd this address range:

crash> struct anon_vma -o
struct anon_vma {
   [0] struct anon_vma *root;
   [8] struct rw_semaphore rwsem;
  [40] atomic_t refcount;
  [48] struct rb_root rb_root;
}
SIZE: 56

crash> rd ffff88204d1d5840 38
ffff88204d1d5840:  0600380100804d34 0000010000006904   4M...8...i......
ffff88204d1d5850:  0000000000000000 0000000000000000   ................
ffff88204d1d5860:  996e000000000020 84b92543288c1b39    .....n.9..(C%..
ffff88204d1d5870:  84b9568a801bb81c 0000568a801bb81c   .....V.......V..
so anon_vma is corrupted.


Since anon_vma is highly connected with swap, so we check the counter of swap_map again,

let's figure out how swap_map increases.

initial value for swap_map entry is zero, when a process's pages are swaped out

by shrink_list, a spare swap slot must be found out to store the page content.

static unsigned long shrink_page_list(struct list_head *page_list)
{
	spin_lock(&swap_lock);
	while (!list_empty(page_list)) {
		//it is a anon page needed to be swaped out
		if (PageAnon(page) && !PageSwapCache(page)) {
			//try to find a free swap entry
			//check each swap area
			for (type = swap_list.next) {
				//try to find a free swap entry inside current swap area
				si = swap_info[type];
				spin_lock(&si->lock);

				spin_unlock(&swap_lock);
				//this is extremely obscure to find a free page slot.
				offset = scan_swap_map(si, SWAP_HAS_CACHE);
				spin_unlock(&si->lock);
			}
			SetPageDirty(page);
		}
		//update pte with swap_entry
		try_to_unmap(page);
		//it is dirty set by previous step
		pageout(page);
	}
}


Besides swap error, there is also a suspected warning on mm->rss_conter:

[3626234.775485] BUG: Bad rss-counter state mm:ffff8810528b8640 idx:1 val:5
[3626234.775529] BUG: Bad rss-counter state mm:ffff8810528b8640 idx:2 val:-5
It is too hard for us to track the root cause because the problem maybe a chain reaction for other

bugs. And according to search result, here is an possible fix for this issue:

http://www.serverphorums.com/read.php?12,950567 
which says:

I saw a bunch of similar errors on 3.14.x up to and including 3.14.4,
running Java (Tomcat) and Postgres on Xen PV. Have not seen it since
"mm: use paravirt friendly ops for NUMA hinting ptes" landed in 3.14.5.

402e194dfc5b38d99f9c65b86e2666b29adebf8c in stable,
29c7787075c92ca8af353acd5301481e6f37082f upstream
commit 29c7787075c92ca8af353acd5301481e6f37082f
Author: Mel Gorman 
Date:   Fri Apr 18 15:07:21 2014 -0700

    mm: use paravirt friendly ops for NUMA hinting ptes

    David Vrabel identified a regression when using automatic NUMA balancing
    under Xen whereby page table entries were getting corrupted due to the
    use of native PTE operations.  Quoting him

        Xen PV guest page tables require that their entries use machine
        addresses if the preset bit (_PAGE_PRESENT) is set, and (for
        successful migration) non-present PTEs must use pseudo-physical
        addresses.  This is because on migration MFNs in present PTEs are
        translated to PFNs (canonicalised) so they may be translated back
        to the new MFN in the destination domain (uncanonicalised).

        pte_mknonnuma(), pmd_mknonnuma(), pte_mknuma() and pmd_mknuma()
        set and clear the _PAGE_PRESENT bit using pte_set_flags(),
        pte_clear_flags(), etc.

        In a Xen PV guest, these functions must translate MFNs to PFNs
        when clearing _PAGE_PRESENT and translate PFNs to MFNs when setting
        _PAGE_PRESENT.

    His suggested fix converted p[te|md]_[set|clear]_flags to using
    paravirt-friendly ops but this is overkill.  He suggested an alternative
    of using p[te|md]_modify in the NUMA page table operations but this is
    does more work than necessary and would require looking up a VMA for
    protections.

    This patch modifies the NUMA page table operations to use paravirt
    friendly operations to set/clear the flags of interest.  Unfortunately
    this will take a performance hit when updating the PTEs on
    CONFIG_PARAVIRT but I do not see a way around it that does not break
    Xen.

This bug can break XEN on numa, so it can also break KVM, since it is a general bug.

你可能感兴趣的:([知其然不知其所以然-18] handle_mm_fault, bad page等错误现场分析)