[linux-kernel] 进程死后,OS会flush吗? & flush时机? & release的时机

引用的主要目的是搞清楚进程crash后, 文件句柄是否会被释放, 数据是否flush回写, release如何被调用.

一. file_operations.flush 

作者:董昊 (要转载的同学帮忙把名字和博客链接 http://donghao.org/uii/带上,多谢了!)
chinaunix上 有人提问:linux下进程异常coredump或被信号杀死,打开的文件会自动flush吗? 好问题,我们使用2.6.9的内核源代码,分各种情况讨论。

1. 正常退出

如果你用 strace 跟一下最简单的linux命令,比如ls、lsof等,你会发现这些进程退出都会调用系统调用exit_group,那我们就从这个系统调用开始。

[ kernel/exit.c ]
   871 NORET_TYPE void
   872 do_group_exit(int exit_code)
   873 {
   874     BUG_ON(exit_code & 0x80); /* core dumps don't get here */
   875
   876     if (current->signal->group_exit)
   877         exit_code = current->signal->group_exit_code;
   878     else if (!thread_group_empty(current)) {
   879         struct signal_struct *const sig = current->signal;
   880         struct sighand_struct *const sighand = current->sighand;
   881         read_lock(&tasklist_lock);
   882         spin_lock_irq(&sighand->siglock);
   883         if (sig->group_exit)
   884             /* Another thread got here before we took the lock.  */
   885             exit_code = sig->group_exit_code;
   886         else {
   887             sig->group_exit = 1;
   888             sig->group_exit_code = exit_code;
   889             zap_other_threads(current);
   890         }
   891         spin_unlock_irq(&sighand->siglock);
   892         read_unlock(&tasklist_lock);
   893     }
   894
   895     do_exit(exit_code);
   896     /* NOTREACHED */
   897 }
   898
   899 /*
   900  * this kills every thread in the thread group. Note that any externally
   901  * wait4()-ing process will get the correct exit code - even if this
   902  * thread is not the thread group leader.
   903  */
   904 asmlinkage void sys_exit_group(int error_code)
   905 {
   906     do_group_exit((error_code & 0xff) << 8);
   907 }

上面的sys_exit_group就是系统统调用exit_group的实现,它调用了do_group_exit,而do_group_exit里当然处理退出各个group的操作,不过我们不关心,我们关心的是它调用了do_exit。一个进程正常退出就是指exit,其实在内核里的实现就是 do_exit(废话!)。还要注意,2.6.9内核有exit_group系统调用,2.6.32里已经没有了,在哪个版本消失的?这个我们以后再说。

[kernel/exit.c]
   783 asmlinkage NORET_TYPE void do_exit(long code)
   784 {
   785     struct task_struct *tsk = current;
   786
   787     profile_task_exit(tsk);
   788
   789     if (unlikely(in_interrupt()))
   790         panic("Aiee, killing interrupt handler!");
   791     if (unlikely(!tsk->pid))
   792         panic("Attempted to kill the idle task!");
   793     if (unlikely(tsk->pid == 1))
   794         panic("Attempted to kill init!");
   795     if (tsk->io_context)
   796         exit_io_context();
   797     tsk->flags |= PF_EXITING;
   798     del_timer_sync(&tsk->real_timer);
   799
   800     if (unlikely(in_atomic()))
   801         printk(KERN_INFO "note: %s[%d] exited with preempt_count %d\n",
   802                 current->comm, current->pid,
   803                 preempt_count());
   804
   805     if (unlikely(current->ptrace & PT_TRACE_EXIT)) {
   806         current->ptrace_message = code;
   807         ptrace_notify((PTRACE_EVENT_EXIT << 8) | SIGTRAP);
   808     }
   809
   810     acct_process(code);
   811     __exit_mm(tsk);
   812
   813     exit_sem(tsk);
   814     __exit_files(tsk);
   815     __exit_fs(tsk);
   816     exit_namespace(tsk);
   817     exit_thread();

do_exit 做的事情不多,退出mm(每个struct task_struct管辖的内存都在这个mm里),退出namespace,退出thread(主要是关闭TSS段上的IOMAP)等,我们关心的是 __exit_files(),这里就不贴代码了,都是一些短小的函数(功能单一,函数短小,好的编码风格,但这么多的函数,如何起名字?这是个挑战),__exit_files()调用put_files_struct(),而put_files_struct()调用 close_files(),关闭这个退出进程的所有打开文件,close_files接着调用filp_close():

[fs/open.c]
   989 int filp_close(struct file *filp, fl_owner_t id)
   990 {   
   991     int retval;
   992         
   993     /* Report and clear outstanding errors */
   994     retval = filp->f_error;
   995     if (retval)
   996         filp->f_error = 0;
   997             
   998     if (!file_count(filp)) {
   999         printk(KERN_ERR "VFS: Close: file count is 0\n");
  1000         return retval;
  1001     }       
  1002             
  1003     if (filp->f_op && filp->f_op->flush) {                             
  1004         int err = filp->f_op->flush(filp);                             
  1005         if (!retval)                                                   
  1006             retval = err;                                              
  1007     }
  1008
  1009     dnotify_flush(filp, id);
  1010     locks_remove_posix(filp, id);                                      
  1011     fput(filp);
  1012     return retval;
  1013 }  

上面是filp_close的实现,再清楚不过了,只要是普通的文件,谁打开谁就得负责关闭,而且关闭之前必须flush。有些程序open了某个文件,没有调用close就正常退出了,这种情况内核其实也通过do_exit帮这个要死的进程关闭(并flush)了它打开的文件,所以不用担心,没有什么资源泄漏。

2. 被信号杀死

这涉及到信号,linux的信号机制比BSD的复杂,这里不详述, 《linux内核源代码情景分析》上册 已经讲得很清楚。这里要关注的是,内核只在回到用户空间之前处理信号,处理信号的如入口是do_signal:

[arch/i386/kernel/signal.c]

  573 int fastcall do_signal(struct pt_regs *regs, sigset_t *oldset)
  574 {
  575     siginfo_t info;
  576     int signr;
  577     struct k_sigaction ka;
  578
  579     /*
  580      * We want the common case to go fast, which
  581      * is why we may in certain cases get here from
  582      * kernel mode. Just return without doing anything
  583      * if so.
  584      */
  585     if ((regs->xcs & 3) != 3)
  586         return 1;
  587
  588     if (current->flags & PF_FREEZE) {
  589         refrigerator(0);
  590         goto no_signal;
  591     }
  592
  593     if (!oldset)
  594         oldset = &current->blocked;
  595
  596     signr = get_signal_to_deliver(&info, &ka, regs, NULL);

get_signal_to_delive()的代码很多,但主要就是循环调用dequeue_signal,从信号队列里拿出所有待处理的信号,逐一处理之,注释如下。

[kernel/signal.c --> get_signal_to_delive]
  1831 int get_signal_to_deliver(siginfo_t *info, struct k_sigaction *return_ka,
  1832               struct pt_regs *regs, void *cookie)
  1833 {
  1834     sigset_t *mask = &current->blocked;
  1835     int signr = 0;
  1836
  1837 relock:
  1838     spin_lock_irq(&current->sighand->siglock);
  1839     for (;;) {
  1840         struct k_sigaction *ka;
  1841
  1842         if (unlikely(current->signal->group_stop_count > 0) &&
  1843             handle_group_stop())
  1844             goto relock;
  1845
  1846         signr = dequeue_signal(current, mask, info);                          // 从当前进程的信号队列里取出信号
  1847
  1848         if (!signr)
  1849             break; /* will return 0 */
  1850
  1851         if ((current->ptrace & PT_PTRACED) && signr != SIGKILL) {      // 如果有strace跟踪当前进程,且无kill信号,则处理之。此代码块内就是strace工作原理。
  1852             ptrace_signal_deliver(regs, cookie);
  1853
  1854             /* Let the debugger run.  */
  1855             ptrace_stop(signr, info);
  1856
  1857             /* We're back.  Did the debugger cancel the sig?  */
  1858             signr = current->exit_code;
  1859             if (signr == 0)
  1860                 continue;
  1861
  1862             current->exit_code = 0;
  1863
  1864             /* Update the siginfo structure if the signal has
  1865                changed.  If the debugger wanted something
  1866                specific in the siginfo structure then it should
  1867                have updated *info via PTRACE_SETSIGINFO.  */
  1868             if (signr != info->si_signo) {
  1869                 info->si_signo = signr;
  1870                 info->si_errno = 0;
  1871                 info->si_code = SI_USER;
  1872                 info->si_pid = current->parent->pid;
  1873                 info->si_uid = current->parent->uid;
  1874             }
  1875
  1876             /* If the (new) signal is now blocked, requeue it.  */
  1877             if (sigismember(&current->blocked, signr)) {
  1878                 specific_send_sig_info(signr, info, current);
  1879                 continue;
  1880             }
  1881         }
  1882
  1883         ka = &current->sighand->action[signr-1];
  1884         if (ka->sa.sa_handler == SIG_IGN) /* Do nothing.  */            // 如果用户要求忽略,那就继续循环,处理下一个信号
  1885             continue;
  1886         if (ka->sa.sa_handler != SIG_DFL) {                                   // 如果用户自己实现了信号处理函数,则执行之
  1887             /* Run the handler.  */
  1888             *return_ka = *ka;
  1889
  1890             if (ka->sa.sa_flags & SA_ONESHOT)
  1891                 ka->sa.sa_handler = SIG_DFL;
  1892
  1893             break; /* will return non-zero "signr" value */
  1894         }
  1895
  1896         /*
  1897          * Now we are doing the default action for this signal.
  1898          */
  1899         if (sig_kernel_ignore(signr)) /* Default is nothing. */
  1900             continue;
  1901
  1902         /* Init gets no signals it doesn't want.  */
  1903         if (current->pid == 1)
  1904             continue;
  1905
  1906         if (sig_kernel_stop(signr)) {                      // 如果是SIGSTOP,SIGSTP,SIGTTIN,SIGTTOU之一,则执行下面的代码块
  1907             /*
  1908              * The default action is to stop all threads in
  1909              * the thread group.  The job control signals
  1910              * do nothing in an orphaned pgrp, but SIGSTOP
  1911              * always works.  Note that siglock needs to be
  1912              * dropped during the call to is_orphaned_pgrp()
  1913              * because of lock ordering with tasklist_lock.
  1914              * This allows an intervening SIGCONT to be posted.
  1915              * We need to check for that and bail out if necessary.
  1916              */
  1917             if (signr == SIGSTOP) {
  1918                 do_signal_stop(signr); /* releases siglock */
  1919                 goto relock;
  1920             }
  1921             spin_unlock_irq(&current->sighand->siglock);
  1922
  1923             /* signals can be posted during this window */
  1924
  1925             if (is_orphaned_pgrp(process_group(current)))
  1926                 goto relock;
  1927
  1928             spin_lock_irq(&current->sighand->siglock);
  1929             if (unlikely(sig_avoid_stop_race())) {
  1930                 /*
  1931                  * Either a SIGCONT or a SIGKILL signal was
  1932                  * posted in the siglock-not-held window.
  1933                  */
  1934                 continue;
  1935             }
  1936
  1937             do_signal_stop(signr); /* releases siglock */
  1938             goto relock;
  1939         }
  1940
  1941         spin_unlock_irq(&current->sighand->siglock);
  1942
  1943         /*
  1944          * Anything else is fatal, maybe with a core dump.
  1945          */
  1946         current->flags |= PF_SIGNALED;
  1947         if (sig_kernel_coredump(signr) &&
  1948             do_coredump((long)signr, signr, regs)) {      // 注意,是SIGSEGV或SIGQUIT或SIGILL等信号,要coredump了!
  1949             /*
  1950              * That killed all other threads in the group and
  1951              * synchronized with their demise, so there can't
  1952              * be any more left to kill now.  The group_exit
  1953              * flags are set by do_coredump.  Note that
  1954              * thread_group_empty won't always be true yet,
  1955              * because those threads were blocked in __exit_mm
  1956              * and we just let them go to finish dying.
  1957              */
  1958             const int code = signr | 0x80;
  1959             BUG_ON(!current->signal->group_exit);
  1960             BUG_ON(current->signal->group_exit_code != code);
  1961             do_exit(code);                                         // 即使coredump,也要调用do_exit的
  1962             /* NOTREACHED */
  1963         }
  1964
  1965         /*
  1966          * Death signals, no core dump.
  1967          */
  1968         do_group_exit(signr);                                    // 上面的都不成立,进程退出
  1969         /* NOTREACHED */
  1970     }
  1971     spin_unlock_irq(&current->sighand->siglock);
  1972     return signr;
  1973 }

看上面代码,coredump如果成功,调用do_exit,要flush的;如果coredump不成功,下面第1968行do_group_exit里也要调用do_exit,还是要flush的。

总结
linux身为操作系统,对进程死掉这种情况必须处理的干干净净,因为这是经常经常发生的事,所以进程只要退出,哪怕是被kill信号杀死,哪怕是coredump,都是要调用flush的正常或异常退出时,都会走do_exit完成资源清理, 至此描述了flush总会被调用!!
所以yahoo利用linux下的这一特性,做了一个虚拟设备,进程启动时打开此设备,这之后,只要进程一死,不管是怎么死的,这个虚拟设备都会知道(因为设备可以截获flush),然后干些重要的事情......不能再说了,梅坚说了,今后不准顺便透露友公司的技术......所以就此打住,我可不想被 fire。

====== 2010.08.10 ======
进程不管是怎么死的,最终会调用flush,这是肯定的,但调用flush不等于说数据就一定会写往硬盘: http://linux.chinaunix.net/bbs/thread-1168706-1-1.html
该链接处的内容:
---------------------------------------------------------------------
我们经常说close(fd)系统调用执行完毕后,文件的数据都会写到硬盘上。
但是通过分析2.4.0的sys_close()代码(fs/open.c),发现两个问题:
1、inode对应的ext2_inode没有同步写回磁盘;
2、文件的脏的缓冲区即inode->i_dirty_buffers中的缓冲区没有同步写回磁盘。

sys_close()的调用流程:
-->filp_close()
       -->fput()
              (如果--file->f_count == 0)
              -->dput()
                     (如果--dentry->d_count==0)并且(list_empty(&dentry->d_hash为真)
                     -->dentry_iput()  
                            -->iput()
                                   (可能是,我不太确定)
                                   -->clear_inode()
                                          -->invalidate_inode_buffers()
下面的就不写了。其中invalidate_inode_buffers()只是将inode->i_dirty_buffers指向的链表中的每一项从该链表中删除,并没有将脏的缓冲区立即写到硬盘上。
是不是我遗漏了什么?要不然两个问题如何解释?
 
应该write back模式,是不保证写入磁盘的; 否则,fsync等是不是就没意义了
man close
A successful close does not guarantee that the data has been successfully saved to disk, as the kernel defers writes.It is not common for a filesystem to flush the buffers when the stream is closed. If you need to be sure that the data is physically stored use fsync(2). (It will depend on the disk hardware at this point.)
-----------------------------------------------------------------------
二. file_operations.release
OS调用release的时机的描述是, 当该文件结构的引用计数为0时, 真正关闭文件, 并调用release. 代码如下:
进程正常退出时sys_close会调用到filp_close(); 异常退出时, 也会调用到filp_close. 进而都会调用到fput, 然后,
In fs/file_table.c

void fastcall fput(struct file *file)
{
 if (atomic_dec_and_test(&file->f_count))
  __fput(file);
}

EXPORT_SYMBOL(fput);

/* __fput is called from task context when aio completion releases the last
 * last use of a struct file *.  Do not use otherwise.
 */
void fastcall __fput(struct file *file)
{
 struct dentry *dentry = file->f_path.dentry;
 struct vfsmount *mnt = file->f_path.mnt;
 struct inode *inode = dentry->d_inode;

 might_sleep();

 fsnotify_close(file);
 /*
  * The function eventpoll_release() should be the first called
  * in the file cleanup chain.
  */
 eventpoll_release(file);
 locks_remove_flock(file);

 if (file->f_op && file->f_op->release)
  file->f_op->release(inode, file);
 security_file_free(file);
 if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL))
  cdev_put(inode->i_cdev);
 fops_put(file->f_op);
 if (file->f_mode & FMODE_WRITE)
  put_write_access(inode);
 put_pid(file->f_owner.pid);
 file_kill(file);
 file->f_path.dentry = NULL;
 file->f_path.mnt = NULL;
 file_free(file);
 dput(dentry);
 mntput(mnt);
}

其余还会减少一系列核心资源的引用计数.
 

你可能感兴趣的:([linux-kernel] 进程死后,OS会flush吗? & flush时机? & release的时机)