waitpid(pid, &status, options) 获取进程退出状态时踩过的坑

问题的关键

    Linux系统函数 pid_t waitpid(pid_t pid, int *stat_val, int options)
    为了判断一个进程是否正常退出以及其执行期间是否有错误发生,我们需要保证上述函数的第二个参数非空,其将保存当前进程退出时的状态信息。那么如何获取进程退出时的状态码呢?
    WIFEXITED(stat_val)
    WEXITSTATUS(stat_val)
    可以组合使用来获取进程正常退出时的状态码,问题的核心正在于此,此时获取的状态码 是一个补码,而非真正的状态码,所以请务必注意这一点。
    拿下面的例子来说,子进程遇到错误,正常退出时状态码为-1,但是通过上述方法获取到的状态码是255.
    参考链接:https://linux.die.net/man/3/waitpid


问题场景

    依据下述代码, 当存在子进程执行失败情况时,程序最终输出如下:
    Task executes failure!
    All the tasks were successfully completed!
    是不是很意外?

/*
 * DESC:
 * Entry point of subprocess
*/ 
int main()
{
    // bussiness logic detail
    ...
    ...
    // any error occur, will exit
    if (...)
    {
        printf("Task executes failure!\n");
        Exit(-1);
    }
            
    return 0;
}
/*
 * DESC:
 * In the main process, two subprocesses will be spawned and each of them 
 * will deal its own task(**

## If failure, the return value is -1

**).  During the execution of all subprocesses, the main
 * process will be hung until all subprocesses exit . So the exit status of main
 * process depends on the two subprocesses.
*/ 
errinfo *spawnTasksProcesses()
{
    errinfo *ei = NULL;
    long process_num = 2;
    long *pid = NULL;
    long return_val = -1;
    int exit_status = -1;
    bool has_error = false;
    int i = 0;
    
    // Spawn two subprocesses
    pid = (long*)calloc(sizeof(long), processes_num);
    for (int i = 0; i < process_num; i++)
    {
        pid[i] = mockFun_spawnProcess(...);
    }
    // Scan the two subprocesses to exit
    while(TRUE)
    {
        return_val = waitpid(-1, &exit_status, WNOHANG);
        // Status of child process has not changed, continue waiting
        if (0 == return_val)
            continue;
        // A process exit normally with error
        if (0 < return_val)
        {
            if (-1 == exit_status)
                has_error = true;
        }
        // Set process ID as -1 once it exits
        for (i = 0; i < process_num; i++)
        {
            if (pid[i] == return_val)
            {
                pid[i] = -1;
                break;   
            }
        }
        // Check if all processes exit
        for (i = 0; i< process_num; i++)
        {
            if (-1 != pid[i])
                break;
        }
        // Make sure all subprocesses exit
        if (i == process_num)
            break;
    }
    if (has_error)
        ei = error_constructor(ERR_NUM, ERR_LEVEL, "Error occurs when doing task.\n");
    return ei;
}

/*
 * DESC:
 * Invoke above function spawnTasksProcesses
*/
void spawnTasksProcesses_invoker()
{
    if (NULL == spawnTasksProcesses())
        printf("All the tasks were successfully completed!\n");
    else
        printf("Not all tasks were successfully completed: Error occurs, please check!\n");
}

问题排查及解决方案

    排查流程如下:
    程序输出信息"All the tasks were successfully completed!" -->
    确定被调函数spawnTasksProcesses()中变量’has_error’的值为false -->
    锁定代码第54行到58行 -->
    Debug查看失败进程退出时变量’exit_status’的值为65280 -->
    真是一个奇怪的数字!!!

    各种资料查阅,同时认真阅读 https://linux.die.net/man/3/waitpid 确认此时的返回值65280并非子进程中设置的返回值(错误出现仅仅会返回-1),距离真相一步之遥。
    为了获取真正的返回值,借助于宏
    WIFEXITED(stat_val)
    WEXITSTATUS(stat_val)
    更新原始程序第56行的条件约束为"   (WIFEXITED(exit_status ) && WEXITSTATUS(exit_status ) == -1"。
    原始程序第54到58行更新如下:

   if (0 < return_val)
   {
       if ((WIFEXITED(exit_status ) && WEXITSTATUS(exit_status ) == -1))
           has_error = true;
   }

    重新编译程序,替换二进制文件,执行依然不符合预期。
    再次Debug,发现此时的’exit_status’值为255.
    说实话,瞬间懵逼了…
    无数只草泥马脑海中闪过后,好像记起什么来… 赶紧审    查一遍官方文档对宏 WEXITSTATUS(stat_val)的说明:

WEXITSTATUS(status)
returns the exit status of the child. This consists of the least significant 8 bits of the status argument that the child specified in a call to exit(3) or _exit(2) or as the argument for a return statement in main(). This macro should only be employed if WIFEXITED returned true.

    补码:-1的补码不就是255吗?
    再次更新上述代码块如下:

   if (0 < return_val)
   {
       if ((WIFEXITED(exit_status ) && WEXITSTATUS(exit_status ) != 0))
           has_error = true;
   }

    再次编译,替换二进制文件,执行。KO!


结论

    认真,认真,再认真阅读官方文档。
    尤其是陌生的系统函数。

你可能感兴趣的:(waitpid(pid, &status, options) 获取进程退出状态时踩过的坑)