linux 系统调用流程分析

  • x86


linux 系统调用流程分析_第1张图片

2.x86 系统调用实现原理


movq $57, %rax




asmlinkage long sys_open(const char __user *filename, int flags, umode_t mode);

  它接受三个参数。那么,参数传递是按照什么规定呢?事实上,当涉及到系统调用时,调用约定与用户态程序一般的调用约定并不相同。x86-64 ABI文档 第A.2.1节,描述了调用约定:

The Linux AMD64 kernel uses internally the same calling conventions as user-level applications (see section 3.2.3 for details). User-level applications that like to call system calls should use the functions from the C library. The interface between the C library and the Linux kernel is the same as for the user-level applications with the following differences:

User-level applications use as integer registers for passing the sequence %rdi, %rsi, %rdx, %rcx, %r8 and %r9. The kernel interface uses %rdi, %rsi, %rdx, %r10, %r8 and %r9.

A system-call is done via the syscall instruction. The kernel clobbers registers %rcx and %r11 but preserves all other registers except %rax.

The number of the syscall has to be passed in register %rax.

System-calls are limited to six arguments, no argument is passed directly on the stack.

Returning from the syscall, register %rax contains the result of the system-call. A value in the range between -4095 and -1 indicates an error, it is -errno.

Only values of class INTEGER or class MEMORY are passed to the kernel.


  • 参数相关
  • 系统调用号
  • 系统调用指令
  • 返回值及错误码


long sys_open(const char *pathname, int flags, mode_t mode);

  那么,编译出来的可执行程序会认为,这个函数是用户态函数,其传参仍然是按 %rdi, %rsi, %rdx, %rcx, %r8, %r9的顺序,与内核接口不符。因此,gcc提供了一个标签asmlinkage来标记这个函数是内核接口的调用约定:

asmlinkage long sys_open(const char *pathname, int flags, mode_t mode);


2.1 系统调用的入参

2.1.1 参数顺序
linux 系统调用流程分析_第2张图片

 435 /*
 436  * System call entry. Upto 6 arguments in registers are supported.
 437  *
 438  * SYSCALL does not save anything on the stack and does not change the
 439  * stack pointer.
 440  */
 442 /*
 443  * Register setup:
 444  * rax  system call number
 445  * rdi  arg0
 446  * rcx  return address for syscall/sysret, C arg3
 447  * rsi  arg1
 448  * rdx  arg2
 449  * r10  arg3  (--> moved to rcx for C)
 450  * r8   arg4
 451  * r9   arg5
 452  * r11  eflags for syscall/sysret, temporary for C
 453  * r12-r15,rbp,rbx saved by C code, not touched.
 454  *
 455  * Interrupts are off on entry.
 456  * Only called from user space.
 457  *
 458  * XXX  if we had a free scratch register we could save the RSP into the stack frame
 459  *      and report it properly in ps. Unfortunately we haven't.
 460  *
 461  * When user can change the frames always force IRET. That is because
 462  * it deals with uncanonical addresses better. SYSRET has trouble
 463  * with them due to bugs in both AMD and Intel CPUs.
 464  */

2.1.2 参数数量


2.1.3 参数类型

  参数类型限制为 INTEGER 和 MEMORY。x86-64 ABI 定义第3.2.3节 Parameter Passing描述:

INTEGER This class consists of integral types that fifit into one of the general purpose registers.
MEMORY This class consists of types that will be passed and returned in memory via the stack.

2.2 返回值及错误码
  当从系统调用返回时,%rax里保存着系统调用结果;如果是-4095 至 -1之间的值,表示调用过程中发生了错误。

2.3 系统调用号


2.4 系统调用指令



  系统调用函数 (system_call) 是系统调用的总入口,将其与中断号 0x80绑定。

#define SYSCALL_VECTOR 0x80
set_system_trap_gate(SYSCALL_VECTOR, &system_call);



movq $57, %rax


  由于需要从用户态栈切换到内核态栈,需要栈切换,因此CPU会将用户态的栈相关的参数(oldss, oldesp)压栈,再调用system_call。

  在 system_call 内部:

  将所有寄存器压入内核栈,其中 ebx, ecx, edx 存放程序的参数以 eax 为偏移量,在 sys_call_table 中找到指定的系统调用的地址,其中sys_call_table 定义了所有的系统函数的地址。


  • 从用户态到内核态通过寄存器传递
  • 内核函数通过栈读取,即通过栈传递
.macro SAVE_ALL
	pushl %fs
	pushl %es
	pushl %ds
	pushl %eax
	pushl %ebp
	pushl %edi
	pushl %esi
	pushl %edx
	pushl %ecx
	pushl %ebx
	movl $(__USER_DS), %edx
	movl %edx, %ds
	movl %edx, %es
	movl $(__KERNEL_PERCPU), %edx
	movl %edx, %fs

	pushl %eax			# save orig_eax
	GET_THREAD_INFO(%ebp)		# system call tracing in operation / emulation
	testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags(%ebp)
	jnz syscall_trace_entry
	cmpl $(nr_syscalls), %eax
	jae syscall_badsys
	call *sys_call_table(,%eax,4)
	movl %eax,PT_EAX(%esp)		# store the return value


	.long sys_restart_syscall	/* 0 - old "setup()" system call, used for restarting */
	.long sys_exit
	.long ptregs_fork
	.long sys_read
	.long sys_write
	.long sys_open		/* 5 */
	.long sys_close
	.long sys_rt_tgsigqueueinfo	/* 335 */
	.long sys_perf_event_open
	.long sys_recvmmsg
	.long sys_fanotify_init
	.long sys_fanotify_mark
	.long sys_prlimit64		/* 340 */

refer to

