C/C++:pthread_cond_timedwait阻塞失败(立刻超时返回)

C/C++:pthread_cond_timedwait阻塞失败(立刻超时返回)

前几天在现网部署软件时,发现一个进程占用CPU非常非常高,仔细探查原因,发现是处理消息时pthread_cond_timedwait阻塞失败,或者说,没有到达预定的时间就已经超时返回。

代码示例如下:

#include 
#include 
#include time.h>

using namespace std;

class Ebupt
{
public:
    Ebupt();
    virtual ~Ebupt();
    void dealMsg(long wait_ns);
private:
    pthread_mutex_t mutex;
    pthread_cond_t cond;
};

Ebupt::Ebupt()
{
    pthread_mutex_init(&mutex, NULL);
    pthread_cond_init(&cond, NULL);
}

Ebupt::~Ebupt()
{
    pthread_mutex_destroy(&mutex);
    pthread_cond_destroy(&cond);
}

void Ebupt::dealMsg(long wait_ns)
{
    pthread_mutex_lock(&mutex);

    struct timeval now;
    gettimeofday(&now, NULL);
    struct timespec abstime;

    if (now.tv_usec*1000 + (wait_ns%1000000000) >= 1000000000)
    {
        abstime.tv_sec = now.tv_sec + wait_ns/1000000000 + 1;
        abstime.tv_nsec = (now.tv_usec*1000 + wait_ns%1000000000)%1000000000;
    }
    else
    {
        abstime.tv_sec = now.tv_sec + wait_ns/1000000000;
        abstime.tv_nsec = now.tv_usec*1000 + wait_ns%1000000000;
    }

    pthread_cond_timedwait(&cond, &mutex, &abstime);
    pthread_mutex_unlock(&mutex);
}

int main()
{
    Ebupt e;
    struct timeval now;
    while (true)
    {
        gettimeofday(&now, NULL);
        cout<<"++"<<now.tv_sec<<":"<<now.tv_usec<200000000);
        gettimeofday(&now, NULL);
        cout<<"--"<<now.tv_sec<<":"<<now.tv_usec<0;
}

编译及输出如下:

[ismp@cn3 20171026]$ g++ -o main main.C -lpthread
[ismp@cn3 20171026]$ ./main
++1509023506:721641
--1509023506:721706
++1509023506:721710
--1509023506:721716
++1509023506:721718
--1509023506:721724
++1509023506:721726
--1509023506:721731
++1509023506:721733
--1509023506:721739
++1509023506:721741
--1509023506:721750
++1509023506:721753
--1509023506:721761
++1509023506:721763
--1509023506:721769
……
(CTRL+C)

理论上,我没有signal,那么应该阻塞200ms,再从阻塞中超时返回,但实际上,并没有阻塞,而是如同脱缰的野马,直接超时返回,由于dealMsg还是在一个while循环中,就如同死循环一般,CPU高当然很正常。

top看下嘞:

top - 21:15:52 up 419 days,  7:30,  2 users,  load average: 9.57, 8.94, 8.32
Tasks: 241 total,   3 running, 238 sleeping,   0 stopped,   0 zombie
Cpu(s): 10.6%us, 63.1%sy,  0.0%ni, 24.6%id,  0.0%wa,  0.0%hi,  1.6%si,  0.0%st
Mem:  32879016k total, 32578784k used,   300232k free,   217448k buffers
Swap:  2097144k total,   749020k used,  1348124k free, 28921976k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                       
22096 ismp      20   0 13904 1116  956 S  3.7  0.0   0:01.84 main                                                                                                           
20409 ismp      20   0  109m 1956 1556 S  0.0  0.0   0:00.02 bash

就是个无阻塞死循环…

这个简单的示例还好,CPU飙到了4%不到,但是我那个进程直接飙到了70%多…

后来找了诸多问题,曾经想过,是不是gettimeofday使用的时钟和pthread_cond_timedwait实际使用的时钟不是同一个?

那我改改试试,如下:

#include 
#include 
#include 
#include 

using namespace std;

class Ebupt
{
public:
    Ebupt();
    virtual ~Ebupt();
    void dealMsg(long wait_ns);
private:
    pthread_mutex_t mutex;
    pthread_cond_t cond;
};

Ebupt::Ebupt()
{
    pthread_mutex_init(&mutex, NULL);
    pthread_cond_init(&cond, NULL);
}

Ebupt::~Ebupt()
{
    pthread_mutex_destroy(&mutex);
    pthread_cond_destroy(&cond);
}

void Ebupt::dealMsg(long wait_ns)
{
    pthread_mutex_lock(&mutex);

    struct timespec now;
    clock_gettime(CLOCK_REALTIME, &now);
    struct timespec abstime;

    if (now.tv_nsec + (wait_ns%1000000000) >= 1000000000)
    {
        abstime.tv_sec = now.tv_sec + wait_ns/1000000000 + 1;
        abstime.tv_nsec = (now.tv_nsec + wait_ns%1000000000)%1000000000;
    }
    else
    {
        abstime.tv_sec = now.tv_sec + wait_ns/1000000000;
        abstime.tv_nsec = now.tv_nsec + wait_ns%1000000000;
    }

    pthread_cond_timedwait(&cond, &mutex, &abstime);
    pthread_mutex_unlock(&mutex);
}

int main()
{
    Ebupt e;
    struct timeval now;
    while (true)
    {
        gettimeofday(&now, NULL);
        cout<<"++"<":"<200000000);
        gettimeofday(&now, NULL);
        cout<<"--"<":"<return 0;
}
[ismp@cn3 20171026]$ g++ -o main main.C -lpthread -lrt
[ismp@cn3 20171026]$ ./main
++1509024234:822675
--1509024234:822733
++1509024234:822737
--1509024234:822748
++1509024234:822751
--1509024234:822761
……
(CTRL+C)

还是没有阻塞,看来并不是那个(gettimeofday和pthread_cond_timedwait使用的时钟不是同一个)原因。

如果我给条件变量加上属性试试?如下:

#include 
#include 
#include 
#include 

using namespace std;

class Ebupt
{
……
Ebupt::Ebupt()
{
    pthread_mutex_init(&mutex, NULL);

    pthread_condattr_t condattr;
    pthread_condattr_init(&condattr);
    pthread_condattr_setclock(&condattr, CLOCK_REALTIME);
    pthread_cond_init(&cond, &condattr);
    pthread_condattr_destroy(&condattr);
}
……(同上)
[ismp@cn3 20171026]$ g++ -o main main.C -lpthread -lrt
[ismp@cn3 20171026]$ ./main
++1509024510:358162
--1509024510:358221
++1509024510:358225
--1509024510:358236
++1509024510:358239
--1509024510:358249
……
(CTRL+C)

后来无意中发现,解决这个问题可以换个时钟,使用MONOTONIC这个时钟

#include 
#include 
#include 
#include 

using namespace std;

class Ebupt
{

……

Ebupt::Ebupt()
{
    pthread_mutex_init(&mutex, NULL);

    pthread_condattr_t condattr;
    pthread_condattr_init(&condattr);
    pthread_condattr_setclock(&condattr, CLOCK_MONOTONIC);
    pthread_cond_init(&cond, &condattr);
    pthread_condattr_destroy(&condattr);
}

……

void Ebupt::dealMsg(long wait_ns)
{
    pthread_mutex_lock(&mutex);

    struct timespec now;
    clock_gettime(CLOCK_MONOTONIC, &now);
    struct timespec abstime;

    if (now.tv_nsec + (wait_ns%1000000000) >= 1000000000)
    {
        abstime.tv_sec = now.tv_sec + wait_ns/1000000000 + 1;
        abstime.tv_nsec = (now.tv_nsec + wait_ns%1000000000)%1000000000;
    }
    else
    {
        abstime.tv_sec = now.tv_sec + wait_ns/1000000000;
        abstime.tv_nsec = now.tv_nsec + wait_ns%1000000000;
    }

    pthread_cond_timedwait(&cond, &mutex, &abstime);
    pthread_mutex_unlock(&mutex);
}

……
[ismp@cn3 20171026]$ g++ -o main main.C -lpthread -lrt
[ismp@cn3 20171026]$ ./main
++1509024798:440277
--1509024798:640389
++1509024798:640400
--1509024798:840413
++1509024798:840424
--1509024799:40507
++1509024799:40517
--1509024799:240565
++1509024799:240581
--1509024799:440595
(CTRL+C)

也就是说,最后解决办法是:

给条件变量设置时钟,使用MONOTONIC,而不使用REALTIME。

MONOTONIC使用的是jiffies变量来计算时间,是一个单调递增的时间,代表boot当前机器的时间,在boot后jiffies初始化为0;

REALTIME使用的是xtime,而这个xtime是在boot后从主板上的硬件时钟(RTC)读取的,运行时刻也会受到特权用户(例如root)使用类似date的命令影响;例如你设定在1h后超时,但是如果在这个阻塞的时间窗口中,你使用date命令将系统时间(或者叫做wall time)调整到1h之后,那么阻塞的语句会立刻超时返回,一如我们的pthread_cond_timedwait。

其实到最后也没有找出到底是什么原因导致的pthread_cond_timedwait阻塞失败,只是偶然间得出的临时的解决办法,后续有时间再研究为何pthread_cond_timedwait阻塞失败吧…

后记:

发现现网的进程的CPU占比都有点不太正常:

top - 21:45:30 up 419 days,  8:00,  1 user,  load average: 8.85, 8.37, 8.38
Tasks: 238 total,   4 running, 234 sleeping,   0 stopped,   0 zombie
Cpu(s): 10.4%us, 65.4%sy,  0.0%ni, 23.5%id,  0.0%wa,  0.0%hi,  0.7%si,  0.0%st
Mem:  32879016k total, 32650184k used,   228832k free,   218716k buffers
Swap:  2097144k total,   749020k used,  1348124k free, 28992212k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                       
12303 sdc       20   0 3517m 864m 7176 S 344.1  2.7  5125261h java                                                                                                          
    9 root      20   0     0    0    0 S 50.5  0.0 144940:15 ksoftirqd/1                                                                                                    
   13 root      20   0     0    0    0 R 48.9  0.0 157182:21 ksoftirqd/2                                                                                                    
    4 root      20   0     0    0    0 R 46.9  0.0 153791:36 ksoftirqd/0                                                                                                    
   33 root      20   0     0    0    0 S 46.5  0.0 148379:24 ksoftirqd/7                                                                                                    
   21 root      20   0     0    0    0 R 44.2  0.0 156277:16 ksoftirqd/4                                                                                                    
   29 root      20   0     0    0    0 S 43.2  0.0 154775:19 ksoftirqd/6                                                                                                    
   17 root      20   0     0    0    0 S 30.9  0.0 174973:53 ksoftirqd/3                                                                                                    
   25 root      20   0     0    0    0 S 10.0  0.0 156328:27 ksoftirqd/5                                                                                                    
27888 www       20   0  177m 121m 1900 S  1.3  0.4   1167:11 nginx                                                                                                          
   41 root      20   0     0    0    0 S  0.3  0.0  17:36.77 events/6                                                                                                       
21937 sdc       20   0  134m 7564 1136 S  0.3  0.0  57:06.09 redis-server                                                                                                   
24218 ismp      20   0 15164 1344  944 R  0.3  0.0   0:00.01 top                                                                                                            
27890 www       20   0  180m 124m 1900 S  0.3  0.4   1163:55 nginx                                                                                                          
27891 www       20   0  170m 114m 1912 S  0.3  0.4   1069:01 nginx                                                                                                          
    1 root      20   0 19348  852  544 S  0.0  0.0   0:01.41 init                                                                                                           
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kthreadd                                                                                                       
    3 root      RT   0     0    0    0 S  0.0  0.0   1:02.86 migration/0                                                                                                    
    5 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/0     

尤其是java后台进程和ksoftirqd。

我猜测java是不是也是底层使用了条件变量结果没有阻塞住?

后来的后来…重启了一下现网的机器,各个进程占用的CPU就降下来了,然后也不会再出现上面阻塞失败的问题了……

如果有小伙伴曾经有见过这个问题,欢迎指教哈,嘿嘿~

你可能感兴趣的:(C-C++,LINUX)