一、问题现象

压测时，多个模块出现报错”unable to create thread: Resource temporarily unavailable“。
无论服务进程或supervisor，都出现此报错，多次重试拉起失败后，服务退出，然后容器退出。

二、原因总结

总的来说，是进程resource limit限制的配置作用范围与内核调度时对用户的限制不统一引起的。即是ulimit限制配置是在容器中读取，对进程生效，而内核调度时，对部分资源（这里是线程数量）的判断依据，不区分进程，是整机单个用户的全部进程资源数量的总和。

用户线程数量是由内核判定的，各容器虽然运行环境隔离，但对于内核来说，只是多个进程。同一个用户id运行的进程，即使是不同容器，内核可见的也是累计的线程数量。
另一方面，用户limit实际的生效配置，却是用户态生效。即limit相关配置是各个容器各自读取。
centos的limit默认配置中，对非root用户进程数量软限制为4096：

root@cvm-172_16_30_8:~ # cat /etc/security/limits.d/20-nproc.conf
# Default limit for number of user's processes to prevent
# accidental fork bombs.
# See rhbz #432903 for reasoning.

*          soft    nproc     4096
root       soft    nproc     unlimited

在进行系统调用增加线程时，内核是以这两个值进行判断。所以就出现了同一个用户id，在某个容器内线程数量并不多，却无法开线程的现象。即是此时改进程本身线程限制为4096，而该id的用户，对于内核来说，机器上总的线程数已经超过4096。
具体机制见下文详解。

三、详解

本部分主要说明了三个方面：

ulimit配置何时生效的。
内核如何对limit合法性进行判定。
docker拉起容器时的ulimit配置继承关系，即如何解决该问题。

ulimit配置生效方式

1. 原容器内进程启动方式：

容器启动时执行entrypoint.sh，该脚本创建指定id的用户，修改目录权限后，通过su切换用户并运行supervisor，进一步拉起服务进程：

➜  data-proxy git:(master) ✗ cat entrypoint.sh
#!/bin/sh
username="yibot"

#create user if not exists
egrep "^${YIBOT_UID}" /etc/passwd >& /dev/null
if [ $? -ne 0 ]
then
    useradd -u "${YIBOT_UID}" "${username}"
fi

mkdir -p /data/yibot/"${MODULE}"/log/ && \
    mkdir -p /data/supervisor/ && \
    chown -R "${YIBOT_UID}":"${YIBOT_UID}" /entrypoint && \
    chown -R "${YIBOT_UID}":"${YIBOT_UID}" /data && \
    su yibot -c "supervisord -n"

2. pam简介

pam（Pluggable Authentication Modules）中文翻译是"可插拔的身份认证模块组"。这些模块本身不属于内核，内核自身没有身份验证的行为。是为了让需要身份验证的应用与身份验证机制本身进行解耦，衍生出来的一套库。现在的su、login等应用都会采用该库。
pam介绍可参考：https://www.linuxjournal.com/article/5940
pam man page: http://man7.org/linux/man-pages/man8/pam.8.html
pam源码：https://github.com/linux-pam/linux-pam/tree/master/libpam

3. pam与ulimit配置读取

查看pam源码发现，在limit处理中 https://github.com/linux-pam/linux-pam/blob/master/modules/pam_limits/pam_limits.c 每一次该pam会话调用，都是parse_config_file-> setup_limits

    retval = parse_config_file(pamh, pwd->pw_name, pwd->pw_uid, pwd->pw_gid, ctrl, pl);

    retval = setup_limits(pamh, pwd->pw_name, pwd->pw_uid, ctrl, pl);

parse_config_file是从给定配置文件中，读取limit配置存放在pl指向的pam_limit_s结构体中，该结构体定义如下：

/* internal data */
struct pam_limit_s {
    int login_limit;     /* the max logins limit */
    int login_limit_def; /* which entry set the login limit */
    int flag_numsyslogins; /* whether to limit logins only for a
                  specific user or to count all logins */
    int priority;    /* the priority to run user process with */
    struct user_limits_struct limits[RLIM_NLIMITS];
    const char *conf_file;
    int utmp_after_pam_call;
    char login_group[LINE_LENGTH];
};

各项limit的值都存在limits数组中，user_limits_struct结构体中包含软限制和硬限制

struct user_limits_struct {
    int supported;
    int src_soft;
    int src_hard;
    struct rlimit limit;
};

其中limit结构体中是在init_limits中通过系统调用getrlimit获取的当前进程的限制值。

解析完配置文件后，在setup_limits中，通过系统调用setrlimit修改当前进程pcb中的rlim相关值

for (i=0, status=LIMITED_OK; ilimits[i].supported) {
        /* skip it if its not known to the system */
        continue;
    }
    if (pl->limits[i].src_soft == LIMITS_DEF_NONE &&
        pl->limits[i].src_hard == LIMITS_DEF_NONE) {
        /* skip it if its not initialized */
        continue;
    }
        if (pl->limits[i].limit.rlim_cur > pl->limits[i].limit.rlim_max)
            pl->limits[i].limit.rlim_cur = pl->limits[i].limit.rlim_max;
    res = setrlimit(i, &pl->limits[i].limit);
    if (res != 0)
      pam_syslog(pamh, LOG_ERR, "Could not set limit for '%s': %m",
             rlimit2str(i));
    status |= res;
    }

以上就是pam库对limit配置的读取和修改过程。系统调用getrlimit和setrlimit具体行为见后文

4. su与pam:

su源码：https://github.com/shadow-maint/shadow/blob/master/src/su.c
在最新的su实现中，可以看到是有pam的条件编译：

#ifdef USE_PAM
    ret = pam_start ("su", name, &conv, &pamh);
    if (PAM_SUCCESS != ret) {
        SYSLOG ((LOG_ERR, "pam_start: error %d", ret);
        fprintf (stderr,
                 _("%s: pam_start: error %d\n"),
                 Prog, ret));
        exit (1);
    }

在最新的centos中，ldd查看su，可以确定是打开了该条件

root@cvm-172_16_30_8:~ # ldd /usr/bin/su | grep pam
    libpam.so.0 => /lib64/libpam.so.0 (0x00007f4d429a6000)
    libpam_misc.so.0 => /lib64/libpam_misc.so.0 (0x00007f4d427a2000)

在su的man里也有说明：

This  version of su uses PAM for authentication, account and session management.  Some configuration options found in other su implementations such as e.g. support of a wheel group have to be configured via PAM.

在pam_start ("su", name, &conv, &pamh)中pam会在/etc/pam.d/下查找名为su的文件进行配置加载，该文件中指定了pam认证中需要用到的库。实现可插拔的特性
最终在pam打开会话pam_open_session会调用pam_limits中的pam_sm_open_session实现limits相关配置文件的解析和设置。
在su切换用户后，默认打开shell，会继承更新后的limits配置，具体继承机制见下文。

/*
         * Use the shell and create an argv
         * with the rest of the command line included.
         */
        argv[-1] = cp;
        execve_shell (shellstr, &argv[-1], environ);

之后再打开的进程，都会进行limits继承
附pam编程例子：https://www.freebsd.org/doc/en_US.ISO8859-1/articles/pam/pam-sample-appl.html

以上解释了在原entrypoint.sh的做法中，su调用pam会读取当前容器中的limit配置(/etc/security/limits.d/)。在非root时，进程limit中的nproc会被设为4096的限制**

5.系统调用setrlimit行为

kernel源码：https://github.com/torvalds/linux
setrlimit系统调用如下

SYSCALL_DEFINE2(setrlimit, unsigned int, resource, struct rlimit __user *, rlim)
{
    struct rlimit new_rlim;

    if (copy_from_user(&new_rlim, rlim, sizeof(*rlim)))
        return -EFAULT;
    return do_prlimit(current, resource, &new_rlim, NULL);
}

current返回的是当前进程的pcb，即task_struct结构体的指针，在do_prlimit中进一步调用security_task_setrlimit修改当前pcb中的limit限制值

int security_task_setrlimit(struct task_struct *p, unsigned int resource,
        struct rlimit *new_rlim)
{
    return call_int_hook(task_setrlimit, 0, p, resource, new_rlim);
}

下面这个操作看不太懂，大概是在链表里进行搜索，然后应用FUNC。还请大佬指点迷津。

#define call_int_hook(FUNC, IRC, ...) ({            \
    int RC = IRC;                       \
    do {                            \
        struct security_hook_list *P;           \
                                \
        hlist_for_each_entry(P, &security_hook_heads.FUNC, list) { \
            RC = P->hook.FUNC(__VA_ARGS__);     \
            if (RC != 0)                \
                break;              \
        }                       \
    } while (0);                        \
    RC;                         \
})

补充一下pcb task_struct部分定义，完整定义参考：https://github.com/torvalds/linux/blob/master/include/linux/sched.h

struct task_struct {
    ...
    /* Real parent process: */
    struct task_struct __rcu    *real_parent;

    /* Recipient of SIGCHLD, wait4() reports: */
    struct task_struct __rcu    *parent;

    /*
     * Children/sibling form the list of natural children:
     */
    struct list_head        children;
    struct list_head        sibling;
    struct task_struct      *group_leader;
    ...

    /* Effective (overridable) subjective task credentials (COW): */
    const struct cred __rcu     *cred;
    ...
    /* Signal handlers: */
    struct signal_struct        *signal;
    ...
}

注：kernel通过list_head与list_entry宏，实现了通用的双链表结构
在struct signal_struct中定义了rlim：

struct signal_struct {
    ...
     /*
     * We don't bother to synchronize most readers of this at all,
     * because there is no reader checking a limit that actually needs
     * to get both rlim_cur and rlim_max atomically, and either one
     * alone is a single word that can safely be read normally.
     * getrlimit/setrlimit use task_lock(current->group_leader) to
     * protect this instead of the siglock, because they really
     * have no need to disable irqs.
     */
    struct rlimit rlim[RLIM_NLIMITS];
    ...
}

rlim数组中即该进程的resource limit相关值，setrlimit最终修改的也即该数组中的值。可见是每个进程单独持有的一组值。

内核判定nproc limit(进程数限制）合法性机制

1. 用户进程总数

在上文给出的task_struct定义中，有一个结构体struct cred，定义如下

struct cred {
    ...
    kuid_t      uid;        /* real UID of the task */
    kgid_t      gid;        /* real GID of the task */
    kuid_t      suid;       /* saved UID of the task */
    kgid_t      sgid;       /* saved GID of the task */
    kuid_t      euid;       /* effective UID of the task */
    kgid_t      egid;       /* effective GID of the task */
    kuid_t      fsuid;      /* UID for VFS ops */
    kgid_t      fsgid;      /* GID for VFS ops */
    ...
    struct user_struct *user;   /* real user ID subscription */
    ...
}

其中struct user_struct定义：

struct user_struct {
    refcount_t __count; /* reference count */
    atomic_t processes; /* How many processes does this user have? */
    atomic_t sigpending;    /* How many pending signals does this user have? */
    ...
}

结合下文说明pcb中的struct user_struct *user是全局唯一，则processes就是系统当前用户在运行的所有进程数(linux中processes与threads几乎相同，内核中没有thread概念)
http://www.mulix.org/lectures/kernel_workshop_mar_2004/things.pdf

In Linux, processes and threads are almost the same. The major difference is that threads share the same virtual memory address space.

2. struct user_struct *user全局唯一

在su的实现中，调用change_uid，最终通过系统调用setuid切换uid

SYSCALL_DEFINE1(setuid, uid_t, uid)
{
    return __sys_setuid(uid);
}

__sys_setuid调用set_user实现用户真正切换，参数new为当前pcb中的cred结构体副本

long __sys_setuid(uid_t uid)
{
    ...
    if (ns_capable_setid(old->user_ns, CAP_SETUID)) {
        new->suid = new->uid = kuid;
        if (!uid_eq(kuid, old->uid)) {
            retval = set_user(new);
            if (retval < 0)
                goto error;
        }
    } else if (!uid_eq(kuid, old->uid) && !uid_eq(kuid, new->suid)) {
        goto error;
    }
}

set_user完整实现：

/*
 * change the user struct in a credentials set to match the new UID
 */
static int set_user(struct cred *new)
{
    struct user_struct *new_user;

    new_user = alloc_uid(new->uid);
    if (!new_user)
        return -EAGAIN;

    /*
     * We don't fail in case of NPROC limit excess here because too many
     * poorly written programs don't check set*uid() return code, assuming
     * it never fails if called by root.  We may still enforce NPROC limit
     * for programs doing set*uid()+execve() by harmlessly deferring the
     * failure to the execve() stage.
     */
    if (atomic_read(&new_user->processes) >= rlimit(RLIMIT_NPROC) &&
            new_user != INIT_USER)
        current->flags |= PF_NPROC_EXCEEDED;
    else
        current->flags &= ~PF_NPROC_EXCEEDED;

    free_uid(new->user);
    new->user = new_user;
    return 0;
}

再来看看alloc_uid：

struct user_struct *alloc_uid(kuid_t uid)
{
    struct hlist_head *hashent = uidhashentry(uid);
    struct user_struct *up, *new;

    spin_lock_irq(&uidhash_lock);
    up = uid_hash_find(uid, hashent);
    spin_unlock_irq(&uidhash_lock);
    ...
}

在kernel/user.c中，uidhashentry定义如下

#define uidhashentry(uid)   (uidhash_table + __uidhashfn((__kuid_val(uid))))

static struct kmem_cache *uid_cachep;
struct hlist_head uidhash_table[UIDHASH_SZ];

加上uid_hash_find的实现：

static struct user_struct *uid_hash_find(kuid_t uid, struct hlist_head *hashent)
{
    struct user_struct *user;

    hlist_for_each_entry(user, hashent, uidhash_node) {
        if (uid_eq(user->uid, uid)) {
            refcount_inc(&user->__count);
            return user;
        }
    }

    return NULL;
}

如此就可以看出，实际上对于一个uid,用户信息结构体user_struct全局唯一。通过uid的hashentry，在链表中查找该结构体，再将指针返回给pcb

3. 新增进程合法性判断

实际上在上文set_user中，已经有如下判断：

if (atomic_read(&new_user->processes) >= rlimit(RLIMIT_NPROC) &&
            new_user != INIT_USER)
        current->flags |= PF_NPROC_EXCEEDED;
    else
        current->flags &= ~PF_NPROC_EXCEEDED;

rlimit(RLIMIT_NPROC)是读取当前进程pcb内的nproc限制，再与新用户总线程数作比较。
另外，在exec的实现__do_execve_file中,也有类似判断：https://github.com/torvalds/linux/blob/master/fs/exec.c

if ((current->flags & PF_NPROC_EXCEEDED) &&
        atomic_read(¤t_user()->processes) > rlimit(RLIMIT_NPROC)) {
        retval = -EAGAIN;
        goto out_ret;
    }

其他创建process时也类似
另外，fork最终通过copy_creds实现了atomic_inc(&p->cred->user->processes);进程数+1
exec最终通过commit_creds实现atomic_inc(&p->cred->user->processes);进程数+1

四、docker容器的ulimit继承关系

1.子进程对父进程ulimt的继承

fork进程时，在fork的实现kernel/fork.c中实现了copy pcb中的内容，其中的copy_signal：

static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
{
    ...
    task_lock(current->group_leader);
    memcpy(sig->rlim, current->signal->rlim, sizeof sig->rlim);
    task_unlock(current->group_leader);
    ...
}

可见完整地复制了pcb中的rlim，如果不使用setrlim进行更改的话，子进程与父进程一致。

2.docker容器启动方式

根据官方文档的说明，启动容器时1号进程的rlim继承于docker daemon：
https://docs.docker.com/engine/reference/commandline/run/

Note: If you do not provide a hard limit, the soft limit will be used for both values. If no ulimits are set, they will be inherited from the default ulimits set on the daemon. as option is disabled now. In other words, the following script is not supported:...

由于docker daemon一般是以root运行，所以即使指定的非root用户运行容器，1号进程仍然是与root一致的rlim。
此时只要不通过pam读取容器内的ulimit配置(如在容器内运行su切换用户，或通过远程登录等），则子进程也都会一致继承root的rlim。

总结来说，该问题的解决方法就是在容器拉起服务进程之前，不要在容器内运行su切换用户。可在容器启动前指定任意用户，不影响ulimit统一继承于docker daemon

docker容器内无法创建线程的问题(ulimit作用机制kernel源码解析)