Linux系统调用始末续...

在上次的getpid系统调用中，发现getpid函数只能第一次执行进入系统调用，后面的就直接执行，似乎没利用系统调用。

先查一下直接利用int $0x80的系统调用流程。

函数如下：

int GetpidAsm(int argc, char **argv)
{
    pid_t pid;
    asm volatile(
    "mov $20, %%eax\n\t"
    "int $0x80\n\t"
    "mov %%eax, %0\n\t"
    :"=m"(pid)
    );
    printf("current process's pid(ASM):%d\n",pid);
    return 0;
}

在系统调用执行的时候，函数就停在了设置的断点sys_getpid处，如下图：

图中的SYSCALL_DEFINE宏甚是显眼，有资料解释如下：

It is used (obviously) to define the given block of code as a system call. For example, fs/ioctl.c has the following code :

SYSCALL_DEFINE3(ioctl, unsigned int, fd, unsigned int, cmd, unsigned long, arg)
{
/* do freaky ioctl stuff */
}

Such a definition means that the ioctl syscall is declared and takes three arguments. The number next to the SYSCALL_DEFINE means the number of arguments. For example, in the case of getpid(void), declared in kernel/timer.c, we have the following code :

SYSCALL_DEFINE0(getpid)
{
        return task_tgid_vnr(current);
}

只不过getpid(void)现在的位置挪到了kernel/sys.c中。

逻辑似乎清晰了些，不妨去查查SYSCALL_DEFINE的水表。

在Linux/include/linux/syscalls.h这个目录下，我们可以看到

Linux系统调用之SYSCALL_DEFINE已经有前辈解释了这样做的目的，以及为什么要这样。不得不服其中的精妙。

这就是为什么，在我们使用getpid()这个函数的时候，我们并不知道系统究竟做了什么，因为系统里面并没有这个函数的直接实现。而是通过一堆宏定义在预处理的时候展开。

那么索性也来做一次展开，对getpid的展开。
系统停在断点sys_time的时候，代码停在了这个位置：

   │816     SYSCALL_DEFINE0(getpid)                                       
b+>│817     {                                                             
   │818             return task_tgid_vnr(current);                        
   │819     }

宏展开规则如下：

175 #define SYSCALL_METADATA(sname, nb, ...)
...
178 #define SYSCALL_DEFINE0(sname)                                  \
179         SYSCALL_METADATA(_##sname, 0);                          \
180         asmlinkage long sys_##sname(void)

使用的宏为SYSCALL_DEFINE0(getpid) ->asmlinkage long sys_getpid(void)；
显然，展开之后的函数变为：

   │816     asmlinkage long sys_getpid(void)                                       
b+>│817     {                                                             
   │818             return task_tgid_vnr(current);                        
   │819     }

这个时候，对sys_getpid()的调用一目了然。
那么，显然，接下来的调用是传入的参数current，不是很明白。

   │10      DECLARE_PER_CPU(struct task_struct *, current_task);          
   │11                                                                    
   │12      static __always_inline struct task_struct *get_current(void)  
   │13      {                                                             
  >│14              return this_cpu_read_stable(current_task);            
   │15      }                                                             
   │16                                                                    
   │17      #define current get_current()

#define this_cpu_read_stable(var) percpu_from_op("mov", var, "p" (&(var)))只是一个宏，在单CPU上，应该没有效果。

接下来执行到task_tgid_vnr

   │1770    static inline pid_t task_tgid_vnr(struct task_struct *tsk)    
   │1771    {                                                             
   │1772            return pid_vnr(task_tgid(tsk));                       
   │1773    }

处理传入的参数task_tgid，实际上返回了一个结构体pid。

   │1708    static inline struct pid *task_tgid(struct task_struct *task) 
   │1709    {                                                             
  >│1710            return task->group_leader->pids[PIDTYPE_PID].pid;     
   │1711    }

upid、pid、pid_link定义如下：

 44 /*
 45  * struct upid is used to get the id of the struct pid, as it is
 46  * seen in particular namespace. Later the struct pid is found with
 47  * find_pid_ns() using the int nr and struct pid_namespace *ns.
 48  */
 49 
 50 struct upid {
 51         /* Try to keep pid_chain in the same cacheline as nr for find_vpid */
 52         int nr;
 53         struct pid_namespace *ns;
 54         struct hlist_node pid_chain;
 55 };
 56 
 57 struct pid
 58 {
 59         atomic_t count;
 60         unsigned int level;
 61         /* lists of tasks that use this pid */
 62         struct hlist_head tasks[PIDTYPE_MAX];
 63         struct rcu_head rcu;
 64         struct upid numbers[1];
 65 };
 66 
 67 extern struct pid init_struct_pid;
 68 
 69 struct pid_link
 70 {
 71         struct hlist_node node;
 72         struct pid *pid;
 73 };

pid_vnr实际上调用了pid_nr_ns，传入了一个task_active_pid_ns来获取namespace，不太懂。。。
其实最后返回的就是pid_nr_ns返回的nr；

   │497     pid_t pid_nr_ns(struct pid *pid, struct pid_namespace *ns)    
   │498     {                                                             
   │499             struct upid *upid;                                    
   │500             pid_t nr = 0;                                         
   │501                                                                   
   │502             if (pid && ns->level <= pid->level) {                 
   │503                     upid = &pid->numbers[ns->level];              
   │504                     if (upid->ns == ns)                           
   │505                             nr = upid->nr;                        
   │506             }                                                     
   │507             return nr;                                            
   │508     }                                                             
   │509     EXPORT_SYMBOL_GPL(pid_nr_ns);                                    
   │510                                                                  
   │511     pid_t pid_vnr(struct pid *pid)                                
  >│512     {                                                             
   │513             return pid_nr_ns(pid, task_active_pid_ns(current));   
   │514     }

获取namespace

   │542     struct pid_namespace *task_active_pid_ns(struct task_struct *tsk)
   │543     {                                                             
  >│544             return ns_of_pid(task_pid(tsk));                      
   │545     }                                                             
   │546     EXPORT_SYMBOL_GPL(task_active_pid_ns);

   │124     /*                                                            
   │125      * ns_of_pid() returns the pid namespace in which the specifie
   │126      * allocated.                                                 
   │127      *                                                            
   │128      * NOTE:                                                      
   │129      *      ns_of_pid() is expected to be called for a process (task) that has
   │130      *      an attached 'struct pid' (see attach_pid(), detach_pid()) i.e @pid
   │131      *      is expected to be non-NULL. If @pid is NULL, caller should handle
   │132      *      the resulting NULL pid-ns.                            
   │133      */    
   │134     static inline struct pid_namespace *ns_of_pid(struct pid *pid)
   │135     {                                                             
   │136             struct pid_namespace *ns = NULL;                      
  >│137             if (pid)                                              
   │138                     ns = pid->numbers[pid->level].ns;             
   │139             return ns;                                            
   │140     }

数据结构太复杂，有些关键的地方并不理解什么意思。从字面上理解，分析到这里的时候，并没有发现有猫腻。

突然有个想法，进程本身并没有有局部变量或者全局变量来保存这个pid的值，因为没必要（进程结束回收后，进程号直接作废了，每次启动的时候都会分配不同的pid）。

那么会不会是编译器的原因，这个值放在了寄存器中了？毕竟是1号进程，这个值以后也不再会变动了，编译器发现1号进程的pid不会变化，把这个值缓存起来了？每次需要读取的时候，直接从这里拿？

在我们使用标准API的时候，一般都会包含unistd.h这个头文件。
我们需要的信息都隐藏在这里面。

unistd.h 中所定义的接口通常都是大量针对系统调用的封装（英语：wrapper functions），如 fork、pipe 以及各种 I/O 原语（read、write、close 等等）

还没有好的思路，下次再写。。

stackoverflow上一个大牛的回答，还没有完全理解。
What is better “int 0x80” or “syscall”?

My answer here covers your question.
In practice, recent kernels are implementing a VDSO, notably to dynamically optimize system calls (the kernel sets the VDSO to some code best for the current processor). So you should use the VDSO, and you'll better use, for existing syscalls, the interface provided by the libc.
Notice that, AFAIK, a significant part of the cost of simple syscalls is going from user-space to kernel and back. Hence, for some syscalls (probably gettimeofday, getpid...) the VDSO might avoid even that (and technically might avoid doing a real syscall). For most syscalls (like open, read, send, mmap ....) the kernel cost of the syscall is large enough to make any improvement of the user-space to kernel space transition (e.g. using SYSENTER or SYSCALL machine instructions instead of INT) insignificant.

注意这一句：
**Hence, for some syscalls (probably gettimeofday, getpid...) the VDSO might avoid even that (and technically might avoid doing a real syscall). **
大牛的回答，要看这么多东西，给跪了。

System calls Implementation

It is explained in Linux Assembly Howto. And you should read wikipedia syscall page (and also about VDSO), and also intro(2) & syscalls(2) man pages. See also this answer and this one. Look also inside Gnu Libc & musl-libc source code. Learn also to use strace
to find out which syscalls are made by a given command or process.
See also the calling conventions and Application Binary Interface specification relevant to your system. For x86-64 it is here.

又见一出资料，显示，getpid 缓存了pids

C library/kernel differences
Since glibc version 2.3.4, the glibc wrapper function for getpid() caches PIDs, so as to avoid additional system calls when a process calls getpid() repeatedly. Normally this caching is invisible, but its correct operation relies on support in the wrapper functions for fork(2), vfork(2), and clone(2): if an application bypasses the glibc wrappers for these system calls by using syscall(2), then a call to getpid() in the child will return the wrong value (to be precise: it will return the PID of the parent process). See also clone(2) for discussion of a case where getpid() may return the wrong value even when invoking clone(2) via the glibc wrapper function.

思路断掉了，不知道该怎么捋清楚。

但是在使用API getpid的时候，是如何和联系上系统调用呢？两者是如何对应起来的呢？glibc中有如下的代码：

pid_t getpid(void)
{
pid_t (f)(void);
f = (pid_t ()(void)) dlsym (RTLD_NEXT, "getpid");
if (f == NULL)
error (EXIT_FAILURE, 0, "dlsym (RTLD_NEXT, "getpid"): %s", dlerror ());
return (pid2 = f()) + 26;
}

这个dlsym的大概意思是说从打开的共享库中找到getpid这个函数的地址，然后直接拿来调用。
之前看到过vdso，难道和这个有关系？
最后返回值加上26是什么意思，又不明白了。

Linux系统调用始末续...

你可能感兴趣的:(Linux系统调用始末续...)