Linux内核之旅/张凯捷——系统调用分析(2)

640?wx_fmt=png 640?wx_fmt=png

    在《系统调用分析(1)》Linux内核之旅/张凯捷——系统调用分析(1)中,首先介绍了系统调用的概念,并对早期通过软中断(int 80)来进行系统调用的相关过程进行了分析,最后分析和介绍了为了提高系统调用的响应执行速度的两种机制——vsyscall和vDSO。

    本篇文章将介绍和分析在指令层面上对系统调用响应速度的优化——快速系统调用指令,32位下使用的sysenter/sysexit;64位下使用的syscall/sysret,以及linux内核中为了支持这些快速系统调用指令所做的相关操作。并且在linux-4.20内核,glibc-2.23版本环境下编写了用户态系统调用程序并对程序运行追踪分析。

640?wx_fmt=png


sysenter / sysexit03

之前说到的vsyscallsvDSO都是从机制上对系统调用速度进行的优化,但是使用软中断来进行系统调用需要进行特权级的切换这一根本问题没有解决。

为了解决这一问题,Intel x86 CPUPentium II (Family6, Model 3, Stepping 3)之后,开始支持快速系统调用指令sysenter/sysexit

在《英特尔®64IA-32架构软件开发人员手册合并卷》中可以找到sysenter指令的相关描述:

Executes a fast call to a level 0 system procedure or routine. SYSENTER is a companion instruction to SYSEXIT. The instruction is optimized to provide the maximum performance for system calls from user code running at privilege level 3 to operating system or executive procedures running at privilege level 0.

When executed in IA-32e mode, the SYSENTER instruction transitions the logical processor to 64-bit mode; otherwise, the logical processor remains in protected mode.

Prior to executing the SYSENTER instruction, software must specify the privilege level 0 code segment and code entry point, and the privilege level 0 stack segment and stack pointer by writing values to the following MSRs:

• IA32_SYSENTER_CS (MSR address 174H) — The lower 16 bits of this MSR are the segment selector for the privilege level 0 code segment. This value is also used to determine the segment selector of the privilege level 0 stack segment (see the Operation section). This value cannot indicate a null selector.

• IA32_SYSENTER_EIP (MSR address 176H) — The value of this MSR is loaded into RIP (thus, this value references the first instruction of the selected operating procedure or routine). In protected mode, only bits 31:0 are loaded.

• IA32_SYSENTER_ESP (MSR address 175H) — The value of this MSR is loaded into RSP (thus, this value contains the stack pointer for the privilege level 0 stack). This value cannot represent a non-canonical address. In protected mode, only bits 31:0 are loaded.

These MSRs can be read from and written to using RDMSR/WRMSR. The WRMSR instruction ensures that the IA32_SYSENTER_EIP and IA32_SYSENTER_ESP MSRs always contain canonical addresses.


主要信息有:

  (1)sysentersysexit指令配套,可以以比较高的执行效率在用户态执行要在系统态执行的系统调用。

  (2)在IA-32e模式下执行时,sysenter指令将逻辑处理器转换为64位模式,否则逻辑处理器保持在保护模式。

  (3)执行sysenter指令之前,需要将下列MSRModel Specific Registers)中写入值来指定Ring0代码段、代码入口点、Ring0堆栈段和堆栈指针:

    - IA32_SYSENTER_CS(174H):指定要执行Ring0代码的代码段选择符,也能得出目标Ring0所用堆栈段的段选择符

    - IA32_SYSENTER_EIP(176H):指定要执行的Ring0代码的起始地址

    - IA32_SYSENTER_ESP(175H):指定要执行的Ring0代码所使用的栈指针

  (4)使用rdmsr/wrmsr读取和写入MSR


下面基于linux-2.6.39内核进行分析:


3.1 系统调用初始化


  从linux内核启动流程入手:start_kernel() -> chenk_bugs() -> identify_boot_cpu() -> sysenter_setup() & enable_sep_cpu()


3.1.1 页面初始化和映射


首先执行sysenter_setup()函数来支持之前提到的vDSO机制,

vdso32-sysenter.so动态链接库装载进vsyscall页中,arch/x86/vdso/vdso32-setup.c可以找到sysenter_setup()函数:

    

 
   

int __init sysenter_setup(void)

{   

    void *syscall_page = (void *)get_zeroed_page(GFP_ATOMIC);    

    const void *vsyscall;    

    size_t vsyscall_len;

   vdso32_pages[0] = virt_to_page(syscall_page);


#ifdef CONFIG_X86_32

   gate_vma_init();

#endif

   if (vdso32_syscall()) {

        vsyscall = &vdso32_syscall_start;

        vsyscall_len = &vdso32_syscall_end - &vdso32_syscall_start;    } else if (vdso32_sysenter()) {         vsyscall = &vdso32_sysenter_start;         vsyscall_len = &vdso32_sysenter_end - &vdso32_sysenter_start;    } else {         vsyscall = &vdso32_int80_start;         vsyscall_len = &vdso32_int80_end - &vdso32_int80_start;

   }    

    memcpy(syscall_page, vsyscall, vsyscall_len);

    relocate_vdso(syscall_page);  


    return 0;

}


sysenter_setup()函数的主要工作:

  (1)调用get_zeroed_page()获得一个被填充为0的物理页,返回该页在内核地址空间的线性地址。

  (2)调用宏virt_to_page得到syscall_page地址对应的page管理结构地址并赋值给vdso32_page[0]

  (3)随后判断支持哪些指令,从而做不同处理,可以看到优先级是syscall > sysenter > int80

  (4)将vdso32_sysenter_start地址赋给vsyscall,然后用memcpy()vsyscall拷贝到对应的页,最后用relocate_vdso()进行重定向。

arch/x86/vdso/vdso32.S中可以看到vdso32_sysenter_start就是vdso32-sysenter.so:

 
   

vdso32_sysenter_start:

    incbin "arch/x86/vdso/vdso32-sysenter.so"

  即将vdso32-sysenter.so拷贝到对应的页中,在《系统调用分析(1)》的vDSO介绍中提到的arch_setup_additional_pages函数便是把拷贝到的页的内容映射到用户空间。


3.1.2 相关MSR寄存器的初始化


  在arch/x86/vdso/vdso32-setup.c中的enable_sep_cpu()函数完成相关MSR寄存器的初始化:

 
   

void enable_sep_cpu(void)

{    

    int cpu = get_cpu();    

    struct tss_struct *tss = &per_cpu(init_tss, cpu);

       if (!boot_cpu_has(X86_FEATURE_SEP)) {        put_cpu();        eturn;    }    tss->x86_tss.ss1 = __KERNEL_CS;    tss->x86_tss.sp1 = sizeof(struct tss_struct) + (unsigned long) tss;    wrmsr(MSR_IA32_SYSENTER_CS, __KERNEL_CS, 0);    wrmsr(MSR_IA32_SYSENTER_ESP, tss->x86_tss.sp1, 0);    wrmsr(MSR_IA32_SYSENTER_EIP, (unsigned long) ia32_sysenter_target, 0);    put_cpu(); }

主要内容:

 (1) MSR_IA32_SYSENTER_*的声明在arch/x86/include/asm/msr-index.h中,可以看到对应的MSR寄存器地址:

 
   

#define MSR_IA32_SYSENTER_CS     0x00000174

#define MSR_IA32)SYSENTER_ESP    0x00000175

#define MSR_IA32_SYSENTER_EIP    0x00000176    

(2)__KERNEL_CS设置到MSR_IA_SYSENTER_CS中。

(3)tss->x86_tss.sp1栈地址设置到MSR_IA32_SYSENTER_ESP中。

(4)ia32_sysenter_target(sysenter指令的接口函数)设置到MSR_IA32_SYSENTER_EIP


3.2 sysenter和sysexit的指令操作


Ring3的代码调用sysenter指令之后,CPU做出如下操作:

  1. SYSENTER_CS_MSR的值装在到cs寄存器。

  2. SYSENTER_EIP_MSR的值装在到eip寄存器。

  3. SYSENTER_CS_MSR的值加8(Ring0的堆栈段描述符)装载到ss寄存器。

  4. SYSENTER_ESP_MSR的值装载到esp寄存器。

  5. 将特权级切换到Ring0

  6. 如果EFLAGS寄存器的VM标志被置位,则清除该标志。

  7. 开始执行指定的Ring0代码。


Ring0代码执行完毕,调用sysexit指令退回Ring3时,CPU会做出如下操作:

  1. SYSENTER_CS_MSR的值加16(Ring3的代码段描述符)装载到cs寄存器。

  2. 将寄存器edx的值装载到eip寄存器。

  3. SYSENTER_CS_MSR的值加24(Ring3的堆栈段描述符)装载到ss寄存器。

  4. 将寄存器ecx的值装载到esp寄存器。

  5. 将特权级切换到Ring3

  6. 继续执行Ring3的代码。


3.3 sysenter的系统调用处理


3.3.1 linux2.6.39内核sysenter系统调用


正如刚才对IA32_SYSENTER_EIP寄存器中传入sysenter的系统调用函数入口地址ia32_sysenter_target

  在arch/x86/ia32/ia32entry.S中可以看到sysenter指令所要执行的系统调用处理程序ia32_sysenter_target的代码,其中执行系统调用的代码是:


 
   

sysenter_dispatch:    

    call    *ia32_sys_call_table(,%rax,8)

   

...

ia32_sys_call_table:    

    .quad sys_restart_syscall

   .quad sys_exit    .quad stub32_fork    .quad sys_read    .quad sys_write    .quad compat_sys_open       /* 5 */


可以看到sysenter指令会直接到系统调用表中找到相应系统调用处理程序去执行。


3.3.2 linux4.20内核sysenter系统调用


linux4.20内核中,对IA32_SYSENTER_EIP寄存器中传入的是entry_SYSENTER_32函数。

  在arch/x86/entry/entry_32.S中可以看到entry_SYSENTER_32()函数:

ENTRY(entry_SYSENTER_32)
    pushfl
    pushl   %eax
    BUG_IF_WRONG_CR3 no_user_check=1
    SWITCH_TO_KERNEL_CR3 scratch_reg=%eax
    popl    %eax
    popfl
    movl    TSS_entry2task_stack(%esp), %esp
    
.Lsysenter_past_esp:
    pushl   $__USER_DS      /* pt_regs->ss */
    pushl   %ebp            /* pt_regs->sp (stashed in bp) */
    pushfl              /* pt_regs->flags (except IF = 0) */
    orl $X86_EFLAGS_IF, (%esp)  /* Fix IF */
    pushl   $__USER_CS      /* pt_regs->cs */
    pushl   $0          /* pt_regs->ip = 0 (placeholder) */
    pushl   %eax            /* pt_regs->orig_ax */
    SAVE_ALL pt_regs_ax=$-ENOSYS    /* save rest, stack already switched */
    testl   $X86_EFLAGS_NT|X86_EFLAGS_AC|X86_EFLAGS_TF, PT_EFLAGS(%esp)
    jnz .Lsysenter_fix_flags
    
.Lsysenter_flags_fixed:
    TRACE_IRQS_OFF
    movl    %esp, %eax
    call    do_fast_syscall_32
...
    sysexit
...

entry_SYSENTER_32()函数主要工作:

  (1)之前说到sysenter指令会将SYSENTER_ESP_MSR的值装载到esp寄存器,但是里面保存的是sysenter_stack的地址,所以通过movl    TSS_entry2task_stack(%esp), %esp语句,修正esp寄存器保存进程的内核栈。

  (2)SAVE_ALLpushl等操作将相关寄存器压栈,保存现场。

  (3)调用do_fast_syscall_32 -> do_syscall_32_irqs_on() 从系统调用表中找到相应处理函数进行调用。

  (4)最后popl相关寄存器返回现场,调用sysexit指令返回。





syscall / sysret04

  在32位下Intel提出快速系统调用指令sysenter/sysexitAMD提出syscall/sysret,到64位时统一使用syscall指令。

  在《英特尔®64IA-32架构软件开发人员手册合并卷》可以找到syscall指令的相关信息:


640?wx_fmt=png

 4-1 syscall指令图

SYSCALL invokes an OS system-call handler at privilege level 0. It does so by loading RIP from the IA32_LSTAR MSR (after saving the address of the instruction following SYSCALL into RCX). (The WRMSR instruction ensures that the IA32_LSTAR MSR always contain a canonical address.)

SYSCALL also saves RFLAGS into R11 and then masks RFLAGS using the IA32_FMASK MSR (MSR address C0000084H); specifically, the processor clears in RFLAGS every bit corresponding to a bit that is set in the IA32_FMASK MSR.

SYSCALL loads the CS and SS selectors with values derived from bits 47:32 of the IA32_STAR MSR. However, the CS and SS descriptor caches are not loaded from the descriptors (in GDT or LDT) referenced by those selectors. Instead, the descriptor caches are loaded with fixed values. See the Operation section for details. It is the responsibility of OS software to ensure that the descriptors (in GDT or LDT) referenced by those selector values correspond to the fixed values loaded into the descriptor caches; the SYSCALL instruction does not ensure this correspondence.

The SYSCALL instruction does not save the stack pointer (RSP). If the OS system-call handler will change the stack pointer, it is the responsibility of software to save the previous value of the stack pointer. This might be done prior to executing SYSCALL, with software restoring the stack pointer with the instruction following SYSCALL (which will be executed after SYSRET). Alternatively, the OS system-call handler may save the stack pointer and restore it before executing SYSRET.


4.1 系统调用追踪


  基于linux-4.20内核,glibc-2.23版本编写用户态程序进行系统调用,使用gdb追踪运行调用过程, 分析过程如下:


  (1)编写包含系统调用的程序:

 
   

#include

#include

#include

#include

int main(void)

{    pid_t ret;    ret = open("./1.txt", O_RDWR);

   close(ret);

}

  (2)编译生成可执行文件:

$ gcc -o open -g -static open.c
$ file openopen: ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), statically linked, for GNU/Linux 2.6.32, BuildID[sha1]=c7781966fa4acbecf2b489a3eea145912f3f81d0, not stripped

(3)gdb调试跟踪:

(gdb) disas main

Dump of assembler code for function main:  

    0x00000000004009ae <+0>:     push   %rbp

    0x00000000004009af <+1>:     mov    %rsp,%rbp  

    0x00000000004009b2 <+4>:     sub    $0x10,%rsp  

    0x00000000004009b6 <+8>:     mov    $0x2,%esi  

    0x00000000004009bb <+13>:    mov    $0x4a0ec4,%edi  

    0x00000000004009c0 <+18>:    mov    $0x0,%eax  

    0x00000000004009c5 <+23>:    callq  0x43f130  

    0x00000000004009ca <+28>:    mov    %eax,-0x4(%rbp)  

    0x00000000004009cd <+31>:    mov    -0x4(%rbp),%eax  

    0x00000000004009d0 <+34>:    mov    %eax,%edi  

    0x00000000004009d2 <+36>:    callq  0x43f3e0  

    0x00000000004009d7 <+41>:    mov    $0x0,%eax  

    0x00000000004009dc <+46>:    leaveq  

    0x00000000004009dd <+47>:    retq

End of assembler dump.


(gdb) disas 0x43f130Dump of assembler code for function open64:  

    0x000000000043f130 <+0>:     cmpl   $0x0,0x28e085(%rip)        # 0x6cd1bc <__libc_multiple_threads>  

    0x000000000043f137 <+7>:     jne    0x43f14d 29>  

    0x000000000043f139 <+0>:     mov    $0x2,%eax  

    0x000000000043f13e <+5>:     syscall

    0x000000000043f140 <+7>:     cmp    $0xfffffffffffff001,%rax  

    0x000000000043f146 <+13>:    jae    0x4441c0 <__syscall_error>  

    0x000000000043f14c <+19>:    retq  

    0x000000000043f14d <+29>:    sub    $0x8,%rsp  

    0x000000000043f151 <+33>:    callq  0x442690 <__libc_enable_asynccancel>  

    0x000000000043f156 <+38>:    mov    %rax,(%rsp)  

    0x000000000043f15a <+42>:    mov    $0x2,%eax  

    0x000000000043f15f <+47>:    syscall

    0x000000000043f161 <+49>:    mov    (%rsp),%rdi   0x000000000043f165 <+53>:    mov    %rax,%rdx   0x000000000043f168 <+56>:    callq  0x4426f0 <__libc_disable_asynccancel>   0x000000000043f16d <+61>:    mov    %rdx,%rax   0x000000000043f170 <+64>:    add    $0x8,%rsp   0x000000000043f174 <+68>:    cmp    $0xfffffffffffff001,%rax   0x000000000043f17a <+74>:    jae    0x4441c0 <__syscall_error>   0x000000000043f180 <+80>:    retq End of assembler dump.

(4)可以看到系统调用open要执行的是, open64定义在glibc源码路

sysdeps/posix/open64.c中:

 
   

#include

#include

#include

/* Open FILE with access OFLAG.  If O_CREAT or O_TMPFILE is in OFLAG,

a third argument is the file protection.  */

int

__libc_open64 (const char *file, int oflag, ...)

{    

    int mode = 0;    

    if (__OPEN_NEEDS_MODE (oflag))

   {        va_list arg;        va_start (arg, oflag);        mode = va_arg (arg, int);        va_end (arg);

   }    

    if (SINGLE_THREAD_P)        

        return __libc_open (file, oflag | O_LARGEFILE, mode);    


    int oldtype = LIBC_CANCEL_ASYNC ();    


    int result = __libc_open (file, oflag | O_LARGEFILE, mode);

   LIBC_CANCEL_RESET (oldtype);    


    return result;

} weak_alias (__libc_open64, __open64) libc_hidden_weak (__open64) weak_alias (__libc_open64, open64)

(5)看到其实是执行__lib_open,__libc_open定义在glibc源码路径sysdeps/unix/sysv/linux/generic/open.c:

int__libc_open (const char *file, int oflag, ...)

{    

    int mode = 0;    

    if (__OPEN_NEEDS_MODE (oflag))

   {        va_list arg;        va_start (arg, oflag);        mode = va_arg (arg, int);        va_end (arg);    }        return SYSCALL_CANCEL (openat, AT_FDCWD, file, oflag, mode); }

  (6)最后执行到SYSCALL_CANCEL宏,glibc源码路径sysdeps/unix/sysdep.h里有着SYSCALL_CANCEL宏定义:

#define __SYSCALL4(name, a1, a2, a3, a4, a5) \

INLINE_SYSCALL (name, 4, a1, a2, a3, a4, a5)

#define __SYSCALL_NARGS_X(a,b,c,d,e,f,g,h,n,...) n

#define __SYSCALL_NARGS(...) \

   __SYSCALL_NARGS_X (__VA_ARGS__,7,6,5,4,3,2,1,0,)

#define __SYSCALL_CONCAT_X(a,b)     a##b

#define __SYSCALL_CONCAT(a,b)       __SYSCALL_CONCAT_X (a, b)

#define __SYSCALL_DISP(b,...) \

   __SYSCALL_CONCAT (b,__SYSCALL_NARGS(__VA_ARGS__))(__VA_ARGS__)

#define __SYSCALL_CALL(...) __SYSCALL_DISP (__SYSCALL, __VA_ARGS__)

#define SYSCALL_CANCEL(...) \

({                                         \

   long int sc_ret;                                \

   if (SINGLE_THREAD_P)                             \    

   sc_ret = __SYSCALL_CALL (__VA_ARGS__);           \

   else                                     \    {                                      \        int sc_cancel_oldtype = LIBC_CANCEL_ASYNC ();         \        sc_ret = __SYSCALL_CALL (__VA_ARGS__);                \        LIBC_CANCEL_RESET (sc_cancel_oldtype);              \

   }                                       \    

    sc_ret;     \

})

根据相关宏定义展开:

 
   

SYSCALL_CANCEL (openat, AT_FDCWD, file, oflag, mode);

-> __SYSCALL_CALL (penat, AT_FDCWD, file, oflag, mode);

-> __SYSCALL_DISP (__SYSCALL, openat, AT_FDCWD, file, oflag, mode);

-> __SYSCALL_CONCAT(__SYSCALL, 4)(openat, AT_FDCWD, file, oflag, mode)

-> __SYSCALL_CONCAT_X(__SYSCALL, 4)(openat, AT_FDCWD, file, oflag, mode) -> __SYSCALL4(openat, AT_FDCWD, file, oflag, mode) -> INLINE_SYSCALL(openat, 4, AT_FDCWD, file, oflag, mode)

  (7)INLINE_SYSCALL之后宏定义与硬件和os有关,在glibc源码路径sysdeps/unix/sysv/linux/x86_64/sysdep.h中定义:

 
   

# define INLINE_SYSCALL(name, nr, args...) \

({                                          \

   unsigned long int resultvar = INTERNAL_SYSCALL (name, , nr, args);            \    

    if (__glibc_unlikely (INTERNAL_SYSCALL_ERROR_P (resultvar, )))        \

   {                                       \        __set_errno (INTERNAL_SYSCALL_ERRNO (resultvar, ));           \        resultvar = (unsigned long int) -1;                   \    }                                       \

   (long int) resultvar;

})


# define INTERNAL_SYSCALL(name, err, nr, args...) \

   INTERNAL_SYSCALL_NCS (__NR_##name, err, nr, ##args)

根据相关宏定义展开:


INLINE_SYSCALL(openat, 4, AT_FDCWD, file, oflag, mode)
-> INTERNAL_SYSCALL(openat,  , 4, AT_FDCWD, file, oflag, mode)
-> INTERNAL_SYSCALL_NCS(__NR_openat,  , 4, AT_FDCWD, file, oflag, mode )

  (8)经过一系列展开,最终到达INTERNAL_SYSCALL_NCS

 
   

# define INTERNAL_SYSCALL_NCS(name, err, nr, args...) \

({                                    \    

    unsigned long int resultvar;      \

   LOAD_ARGS_##nr (args)             \    LOAD_REGS_##nr                    \

   asm volatile (                    \    

    "syscall\n\t"                     \

   : "=a" (resultvar)                \    : "0" (name) ASM_ARGS_##nr : "memory", REGISTERS_CLOBBERED_BY_SYSCALL); \    (long int) resultvar; })

  (9)可以看到LOAD_ARGS_##nr把参数args展开,LOAD_REGS_##nr设置相应参数到相应地寄存器中,汇编嵌入调用syscall指令执行系统调用。


4.2 syscall系统调用初始化

    基于linux-4.20内核源码进行分析:

syscall系统调用初始化在内核启动执行路径中:start_kernel() -> trap_init() -> cpu_init() -> syscall_init()

    arch/x86/kernel/cpu/common.c中可以看到syscall_init()函数:

 
   

void syscall_init(void)

{

   wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);    wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);    ... }

syscall_init()函数源码可以看到对相应地MSR寄存器进行初始化:

(1)向MSR_STAR32 ~ 47位写入内核态的cs,向48 ~ 64位设置用户态的cs。

  (2)向MSR_LSTAR写入entry_SYSCALL_64函数入口地址。


4.3 执行syscall


  执行syscall,会跳转到entry_SYSCALL_64,在arch/x86/entry/entry_64.S中可以找到entry_SYSCALL_64

ENTRY(entry_SYSCALL_64)
    UNWIND_HINT_EMPTY

   swapgs    movq   %rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)    

    SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp    

    movq   PER_CPU_VAR(cpu_current_top_of_stack), %rsp    

    pushq  $__USER_DS              /* pt_regs->ss */

   pushq PER_CPU_VAR(cpu_tss_rw + TSS_sp2)   /* pt_regs->sp */    pushq  %r11                    /* pt_regs->flags */    pushq  $__USER_CS              /* pt_regs->cs */

   pushq  %rcx                    /* pt_regs->ip */

GLOBAL(entry_SYSCALL_64_after_hwframe)    

    pushq  %rax                    /* pt_regs->orig_ax */

   PUSH_AND_CLEAR_REGS rax=$-ENOSYS    TRACE_IRQS_OFF    /* IRQs are off. */

   movq   %rax, %rdi    

    movq   %rsp, %rsi

   call    do_syscall_64    ...

  (1)保存现场将相关寄存器中的值压栈,包括:

    - rax system call number

    - rcx return address

    - r11 saved rflags (note: r11 is callee-clobbered register in C ABI)

    - rdi arg0

    - rsi arg1

    - rdx arg2

    - r10 arg3 (needs to be moved to rcx to conform to C ABI)

    - r8 arg4

    - r9 arg5 

  (2)调用do_syscall_64来继续执行,在arch/x86/entry/common.c中:

 
   

__visible void do_syscall_64(unsigned long nr, struct pt_regs *regs)

{

   ...    nr = syscall_trace_enter(regs);

   nr &= __SYSCALL_MASK;    

    if (likely(nr < NR_syscalls)) {

       nr = array_index_nospec(nr, NR_syscalls);        regs->ax = sys_call_table[nr](regs);    }    syscall_return_slowpath(regs); }

syscall_trace_enter取出系统调用号 nr;到sys_call_table中去找到nr号对应的系统调用服务程序去执行后返回值放入ax

  (3)全部执行完毕后会调用USERGS_SYSRET64返回:

ENTRY(entry_SYSCALL_64)
    ...

   USERGS_SYSRET64

END(entry_SYSCALL_64)

#define USERGS_SYSRET64 \ swapgs;                 \ sysretq;

    本文对32位和64位下的快速系统调用指令进行了介绍和分析,通过对用户态进行系统调用的程序执行过程追踪,以及对linux-2.6.39和linux-4.20内核源码中支持快速系统调用相关部分进行分析,了解了进行系统调用的执行过程和内核对快速系统调用的相关操作。

    下篇将基于Linux-5.0-rc2内核,添加系统调用,完成一个”系统调用日志收集系统“,并对系统调用分析进行总结。




查看我们精华技术文章请移步:

Linux阅码场原创精华文章汇总

你可能感兴趣的:(Linux内核之旅/张凯捷——系统调用分析(2))