【CPU篇】上下文切换context switch

文章结构

  • 概念
  • 指令
  • 症状
  • 资料

目录

  1. 上下文切换
    1. Linux context_switch注释
    2. context_switch的两种子状态:CSWCH和NVCSWCH
  2. 基础知识
    1. 指令
    2. 症状
    3. context_switch细分情况
      1. 进程
      2. 线程
      3. 中断
      4. 系统调用
    4. 资料

 

一、上下文切换

context switch:主要是切换mm(switch_mm函数)和cpu register state(switch_to函数,包括寄存器和堆栈)。

Linux context_switch函数的源码注释:context_switch - switch to the new MM and the new thread's register state.

  1. 源码地址:https://elixir.bootlin.com/linux/latest/source/kernel/sched/core.c#L3324
/*
 * context_switch - switch to the new MM and the new thread's register state.
 */
static __always_inline struct rq *
context_switch(struct rq *rq, struct task_struct *prev,
	       struct task_struct *next, struct rq_flags *rf)
{
	prepare_task_switch(rq, prev, next);

	/*
	 * For paravirt, this is coupled with an exit in switch_to to
	 * combine the page table reload and the switch backend into
	 * one hypercall.
	 */
	arch_start_context_switch(prev);

	/*
	 * kernel -> kernel   lazy + transfer active
	 *   user -> kernel   lazy + mmgrab() active
	 *
	 * kernel ->   user   switch + mmdrop() active
	 *   user ->   user   switch
	 */
	if (!next->mm) {                                // to kernel
		enter_lazy_tlb(prev->active_mm, next);

		next->active_mm = prev->active_mm;
		if (prev->mm)                           // from user
			mmgrab(prev->active_mm);
		else
			prev->active_mm = NULL;
	} else {                                        // to user
		membarrier_switch_mm(rq, prev->active_mm, next->mm);
		/*
		 * sys_membarrier() requires an smp_mb() between setting
		 * rq->curr / membarrier_switch_mm() and returning to userspace.
		 *
		 * The below provides this either through switch_mm(), or in
		 * case 'prev->active_mm == next->mm' through
		 * finish_task_switch()'s mmdrop().
		 */
		switch_mm_irqs_off(prev->active_mm, next->mm, next);

		if (!prev->mm) {                        // from kernel
			/* will mmdrop() in finish_task_switch(). */
			rq->prev_mm = prev->active_mm;
			prev->active_mm = NULL;
		}
	}

	rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);

	prepare_lock_switch(rq, next, rf);

	/* Here we just switch the register state and the stack. */
	switch_to(prev, next, prev);
	barrier();

	return finish_task_switch(prev);
}

【CPU篇】上下文切换context switch_第1张图片

【CPU篇】上下文切换context switch_第2张图片

根据引发context switch的原因,又分为两种情况——CSWCH(自愿上下文切换)和NVCSWCH(非自愿上下文切换),查看man pidstat中:

  1. CSWCH(voluntary context switch):自愿上下文切换,“Total number of voluntary context switches the task made per second. A voluntary context switch occurs when a task blocks because it requires a resource that is unavailable. Or said in other terms, the system had no resources available when the process asked for them (can be I/O, Memory, CPU, etc.)”
  2. NVCSWCH(non voluntary context switch):非自愿上下文切换,“Total number of non voluntary context switches the task made per second. A involuntary context switch takes place when a task executes for the duration of its time slice and then is forced to relinquish the processor. Same as above, just that the process was forced to stop it's execution while running. Ex.: This can happens when a process with higher priority starts nagging for resources while normal priority ones are running.”

 

二、基础知识

  1. 指令:vmstat、pidstat(vmstat查看系统整体的context switch情况,pidstat查看进程的context switch情况)
  2. 症状:上下文切换过多,会导致sys系统态CPU增加,换句话说CPU资源都被内核用了
    1. 如果是CSWCH过多,那要定位是什么资源短缺导致频繁context switch影响系统性能
    2. 如果是NVCSWCH过多,那要看是哪里导致那么多R态进程等待调度运行
  3. context_switch细分情况
    1. 进程:经过进程调度schedule函数之后,进入context_switch(switch_mm和switch_to)
      1. 前情摘要:Intel X86给的进程切换方案是每个进程的硬件上下文保存在TSS中(每个进程都有自己的TSS),进程你切换就是切换不同的TSS;每个进程有自己的TR寄存器值(TR值存放在PCB中),TR值16bit,其实就是指向GDT中某一个描述符的选择子,然后到GDT中找到描述符定位TSS。这套进程切换的方案太复杂低效,所以Linux没有用这套方案。Linux是每个CPU有一个per-cpu的变量TSS和一个不会改变的指向这个TSS的TR,进程切换就是把调度进程的硬件上下文填充到TSS中。
        1. X86:系统有一个TR寄存器,每个进程有自己的TR值和TSS,进程切换会把进程的TR值装载到TR寄存器,然后将TR指向的TSS中保存的寄存器加载到CPU中;
        2. Linux:系统有一个TR寄存器指向一个TSS,每个进程有自己的thread_struct,进程切换不需要切换TR值,只需要将进程自己的thread_struct中保存的寄存器加载到TSS中(只有少部分寄存器,不需要像X86的方案全量切换寄存器)
      2. switch_mm:task_struct->*pgd放到CR3页目录基址寄存器PDBR中,修改CR3寄存器会自动刷快表flush tlb(这里可以去看ULK第九章)
      3. switch_to:其实主要寄存器和栈的切换。前情摘要里面说了,Linux的进程切换方案除了switch_mm,剩下的就是把调度的进程硬件上下文填充到TSS中,而Linux中task_struct->thread_struct就是存放需要填充到TSS中的寄存器值,所以只要把新进程的thread_stuct填充到TSS,就可以认为是完成了进程切换(这里可以去看ULK第三章)。
      4. 一些补充:Linux上没用SS寄存器(所有进程的/线程SS指向的其实是user_ds用户态数据段),用户栈地址是SP(栈顶指针寄存器)和BP(栈基址指针寄存器),内核栈是task_struct->*stack,32bit和64bit机器上的内核栈初始默认会分配8KB和16KB的空间,这个空间结构顶部是pt_regs用于存放常见、通用的用户态CPU寄存器(CPU内的寄存器不只是struct pt_regs结构定义的那么几个,pt_regs中是常见、通用的寄存器,譬如在系统调用中被调用函数一般会用到pt_regs中的寄存器,所以Linux定义了宏SAVE_ALL,一发入魂,不需要在自己写进入汇编子程序前保存寄存器的操作,剩下一半会在进程调度的时候再完全保存,例如浮点寄存器,一些控制寄存器等,会保存到task_struct-> thread_struct中),底部是thread_info用于存放体系结构相关的东西,中间是内核栈。所以,switch_to函数保存register state其实就是找到task_struct->*stack,然后把寄存器保存到stack->pg_regs中,而内核栈就是task_stack->*stack,用户栈是SP和BP寄存器描述;
        1. 【CPU篇】上下文切换context switch_第3张图片
        2. 【CPU篇】上下文切换context switch_第4张图片
        3. 【CPU篇】上下文切换context switch_第5张图片
        4. 【CPU篇】上下文切换context switch_第6张图片
        5. #ifdef __i386__
          struct pt_regs {
          	unsigned long bx;
          	unsigned long cx;
          	unsigned long dx;
          	unsigned long si;
          	unsigned long di;
          	unsigned long bp;
          	unsigned long ax;
          	unsigned long ds;
          	unsigned long es;
          	unsigned long fs;
          	unsigned long gs;
          	unsigned long orig_ax;
          	unsigned long ip;
          	unsigned long cs;
          	unsigned long flags;
          	unsigned long sp;
          	unsigned long ss;
          };
          #else 
          struct pt_regs {
          	unsigned long r15;
          	unsigned long r14;
          	unsigned long r13;
          	unsigned long r12;
          	unsigned long bp;
          	unsigned long bx;
          	unsigned long r11;
          	unsigned long r10;
          	unsigned long r9;
          	unsigned long r8;
          	unsigned long ax;
          	unsigned long cx;
          	unsigned long dx;
          	unsigned long si;
          	unsigned long di;
          	unsigned long orig_ax;
          	unsigned long ip;
          	unsigned long cs;
          	unsigned long flags;
          	unsigned long sp;
          	unsigned long ss;
          /* top of stack page */
          };
          #endif 

           

    2. 线程:经历过进程调度之后,如果是同一进程中的线程,因为共享虚拟内存,所以只需要switch_to;如果是不同进程的线程,则同进程的context_switch
    3. 中断:教材上实模式分为内中断和外中断,保护模式一般分成异常、软中断和硬中断,Intel分成异常(fault、trap、abort和int)和中断(外中断或硬中断,中断过程+软中断softirqs)(见【CPU篇】第三篇“中断”)
    4. 系统调用:32bit是软中断int 0x80,64bit是指令syscall。系统调用中定义的宏SAVE_ALL只是将常见的用户寄存器保存到了内核栈中task_struct->*stack->pt_regs中,这是因为系统调用内核函数是在同一个上下文中,所以不需要像进程切换或线程切换那样替换硬件上下文(即将task_struct->thread_struct填充到per-CPU->TR->TSS中)
      1. 32bit机器上整体的流程:glibc将系统调用封装成高级语言的函数(32bit机器的系统调用在sysdep.h文件中),调用号(系统调用表的下标,系统调用表本质是一个存放系统调用的函数指针的数组)放在eax寄存器中,底层会走到定义的宏ENTER_KERNEL,这个宏本质是int 0x80的软中断;然后会到IDT+GDT中找到中断服务程序ISR(也就是下文源码三中的entry_INT80_32),在这个ISR中会调用宏SAVE_ALL将系统调用表中的函数可能会用到的用户寄存器保存到内核栈pt_regs结构中,然后调用do_syscall_32_irqs_on(这个函数回去系统调用表中找到要调用的函数),最后中断返回。
      2. SAVE_ALL:https://elixir.bootlin.com/linux/latest/source/arch/x86/entry/entry_32.S#L309
        1. .macro SAVE_ALL pt_regs_ax=%eax switch_stacks=0 skip_gs=0 unwind_espfix=0
          	cld
          .if \skip_gs == 0
          	PUSH_GS
          .endif
          	pushl	%fs
          
          	pushl	%eax
          	movl	$(__KERNEL_PERCPU), %eax
          	movl	%eax, %fs
          .if \unwind_espfix > 0
          	UNWIND_ESPFIX_STACK
          .endif
          	popl	%eax
          
          	FIXUP_FRAME
          	pushl	%es
          	pushl	%ds
          	pushl	\pt_regs_ax
          	pushl	%ebp
          	pushl	%edi
          	pushl	%esi
          	pushl	%edx
          	pushl	%ecx
          	pushl	%ebx
          	movl	$(__USER_DS), %edx
          	movl	%edx, %ds
          	movl	%edx, %es
          .if \skip_gs == 0
          	SET_KERNEL_GS %edx
          .endif
          	/* Switch to kernel stack if necessary */
          .if \switch_stacks > 0
          	SWITCH_TO_KERNEL_STACK
          .endif
          .endm

           

      3. 32bit系统调用源码
        1. 源码一:ENTER_KERNEL 
          /* Linux takes system call arguments in registers:
          	syscall number	%eax	     call-clobbered
          	arg 1		%ebx	     call-saved
          	arg 2		%ecx	     call-clobbered
          	arg 3		%edx	     call-clobbered
          	arg 4		%esi	     call-saved
          	arg 5		%edi	     call-saved
          	arg 6		%ebp	     call-saved
          ......
          */
          #define DO_CALL(syscall_name, args)                           \
              PUSHARGS_##args                               \
              DOARGS_##args                                 \
              movl $SYS_ify (syscall_name), %eax;                          \
              ENTER_KERNEL                                  \
              POPARGS_##args

           

        2. 源码二:ENTER_KERNEL的本质是一个软中断int 0x80
          # define ENTER_KERNEL int $0x80

           

        3. 源码三:通过pushl和SAVE_ALL将用户态寄存器保存到pt_regs中
          ENTRY(entry_INT80_32)
                  ASM_CLAC
                  pushl   %eax                    /* pt_regs->orig_ax */
                  SAVE_ALL pt_regs_ax=$-ENOSYS    /* save rest */
                  movl    %esp, %eax
                  call    do_syscall_32_irqs_on
          .Lsyscall_32_done:
          ......
          .Lirq_return:
          	INTERRUPT_RETURN

           

        4. 源码四:系统调用表
          #define ia32_sys_call_table sys_call_table
          

           

      4. 64bit系统调用源码:64bit机器的系统调用也在sysdep.h中,调用号保存在rax寄存器中,进入syscall之后也会保存寄存器
        1. 源码一:syscall
          /* The Linux/x86-64 kernel expects the system call parameters in
             registers according to the following table:
              syscall number	rax
              arg 1		rdi
              arg 2		rsi
              arg 3		rdx
              arg 4		r10
              arg 5		r8
              arg 6		r9
          ......
          */
          #define DO_CALL(syscall_name, args)					      \
            lea SYS_ify (syscall_name), %rax;				  \
            syscall

           

        2. 源码二(arch/x86/entry/entry_64.S):系统调用在内核态,所以pushq压栈指令操作的是内核栈,一开始就把用户态寄存器都压栈,放到pt_regs结构中
          ENTRY(entry_SYSCALL_64)
                  /* Construct struct pt_regs on stack */
                  pushq   $__USER_DS                      /* pt_regs->ss */
                  pushq   PER_CPU_VAR(rsp_scratch)        /* pt_regs->sp */
                  pushq   %r11                            /* pt_regs->flags */
                  pushq   $__USER_CS                      /* pt_regs->cs */
                  pushq   %rcx                            /* pt_regs->ip */
                  pushq   %rax                            /* pt_regs->orig_ax */
                  pushq   %rdi                            /* pt_regs->di */
                  pushq   %rsi                            /* pt_regs->si */
                  pushq   %rdx                            /* pt_regs->dx */
                  pushq   %rcx                            /* pt_regs->cx */
                  pushq   $-ENOSYS                        /* pt_regs->ax */
                  pushq   %r8                             /* pt_regs->r8 */
                  pushq   %r9                             /* pt_regs->r9 */
                  pushq   %r10                            /* pt_regs->r10 */
                  pushq   %r11                            /* pt_regs->r11 */
                  sub     $(6*8), %rsp                    /* pt_regs->bp, bx, r12-15 not saved */
                  movq    PER_CPU_VAR(current_task), %r11
                  testl   $_TIF_WORK_SYSCALL_ENTRY|_TIF_ALLWORK_MASK, TASK_TI_flags(%r11)
                  jnz     entry_SYSCALL64_slow_path
          ......
          entry_SYSCALL64_slow_path:
                  /* IRQs are off. */
                  SAVE_EXTRA_REGS
                  movq    %rsp, %rdi
                  call    do_syscall_64           /* returns with IRQs disabled */
          return_from_SYSCALL_64:
          	RESTORE_EXTRA_REGS
          	TRACE_IRQS_IRETQ
          	movq	RCX(%rsp), %rcx
          	movq	RIP(%rsp), %r11
              movq	R11(%rsp), %r11
          ......
          syscall_return_via_sysret:
          	/* rcx and r11 are already restored (see code above) */
          	RESTORE_C_REGS_EXCEPT_RCX_R11
          	movq	RSP(%rsp), %rsp
          	USERGS_SYSRET64

           

      5. syscall指令
        1. https://stackoverflow.com/questions/10583891/is-syscall-an-instruction-on-x86-64
        2. https://man7.org/linux/man-pages/man2/syscall.2.html
        3. https://stackoverflow.com/questions/2535989/what-are-the-calling-conventions-for-unix-linux-system-calls-on-i386-and-x86-6
  4. 资料
    1. Linux man page:https://www.man7.org/linux/man-pages/man1/pidstat.1.html
    2. 《understanding Linux kernel 3rd edition》
    3. 《Linux内核源码情景分析》
    4. 趣谈Linux第九节、第十四节、第十六节、第二十二节
    5. https://blog.csdn.net/gatieme/article/details/51872659
    6. Linux性能优化实战 CPU性能篇

 

你可能感兴趣的:(Linux运维)