一个越界导致的概率性重启问题排查

问题:雷视的德清现场出现频繁重启问题,生成的coredump文件显示有好几个线程导致重启,其中4个都显示布防线程发出了signal 11

  • 布防线程相关代码:

    int RADAR_COORDINATE_SERVER::PicServ_SDK_Recv(int sockfd,char *pbuf, UINT32 buflen, UINT32 *dwOutlen)
    {
        ...
        NETRET_HEADER *pHead = NULL;
        pHead = (NETRET_HEADER *)pbuf;
        (iRet = readn(sockfd, pbuf, sizeof(NETRET_HEADER)));
        //coredump在运行下面这行时崩溃了
        dwPackLen = ntohl(pHead->length);   
        ...
    }
    int RADAR_COORDINATE_SERVER::Send_UserExchange(void)
    {	
        int 	fd = -1;	
        UINT32	dwRetLen = 0;
        UINT32 	buff[PICSERV_BUFFERLEN] = {0};
        char	*pbuf = (char *)buff;
        IPCM_SDK_NETCMD_HEADER head;
        Make_SDK_Head(NETCMD_USEREXCHANGE, (char*)&head, sizeof(head), &dwRetLen, &Addr, mac);
        if(PicServ_SDK_Recv(fd, pbuf, sizeof(buff), &dwRetLen);
        pHead = (NETRET_HEADER *)pbuf;
        ...
    }
    
  • coredump的打印如下:

    
    warning: Could not load shared library symbols for 14 libraries, e.g. /lib/libhisdk.so.
    Use the "info sharedlibrary" command to see the complete listing.
    Do you need "set solib-search-path" or "set sysroot"?
    Core was generated by `./hisi_app'.
    Program terminated with signal 11, Segmentation fault.
    #0  RADAR_COORDINATE_SERVER::PicServ_SDK_Recv (this=this@entry=0x3569478, sockfd=1154, pbuf=0x4 
    , pbuf@entry=0x54a05a60 "", buflen=buflen@entry=1024, dwOutlen=0x54a05940, dwOutlen@entry=0x54a05938) at dataManagement/picserv/RadarCoordinate_serv.cpp:237 warning: Source file is more recent than executable. 237 dwPackLen = ntohl(pHead->length); //coredump ?????????? (gdb) bt #0 RADAR_COORDINATE_SERVER::PicServ_SDK_Recv (this=this@entry=0x3569478, sockfd=1154, pbuf=0x4
    , pbuf@entry=0x54a05a60 "", buflen=buflen@entry=1024, dwOutlen=0x54a05940, dwOutlen@entry=0x54a05938) at dataManagement/picserv/RadarCoordinate_serv.cpp:237 #1 0x002950f4 in RADAR_COORDINATE_SERVER::AlarmUp (this=this@entry=0x3569478) at dataManagement/picserv/RadarCoordinate_serv.cpp:938 #2 0x0029587c in RADAR_COORDINATE_SERVER::taskRadarCoordinateServer (_p=0x3569478) at dataManagement/picserv/RadarCoordinate_serv.cpp:1216 #3 0x76e19b3c in ?? () Backtrace stopped: frame did not save the PC
  • 值得注意的一点是,pbuf的地址不能正确打印(也有可能是被编译器优化了):

    pbuf=0x4 <Address 0x4 out of bounds>
    (gdb) p dwOutlen
    $1 = (UINT32 *) 0x54a05940
    (gdb) p pbuf
    $2 = 0x4 <Address 0x4 out of bounds>
    
    • 但是打印dwOutlen的地址发现能正常打印
  • 回到上一个栈帧,打印上述两个地址:

    (gdb) up 1
    #1  0x002950f4 in RADAR_COORDINATE_SERVER::AlarmUp (this=this@entry=0x3569478) at dataManagement/picserv/RadarCoordinate_serv.cpp:938
    938             if(PicServ_SDK_Recv(m_iSockFd, pbuf, sizeof(buff), &dwRetLen) < 0)
    (gdb) p dwRetLen
    $3 = 256
    (gdb) p &dwRetLen
    $4 = (UINT32 *) 0x54a05940
    (gdb) p pbuf
    $5 = 0x54a05a60 ""
    (gdb) p &buff[0]
    $6 = (unsigned int *) 0x54a05a68
    
  • 可以看到dwRetLen在这个栈中的地址,与下一个函数调用时作为实参传过去的地址是一样的,pbuf的地址可能放在寄存器中所限打印是0x4,这也不能确定是不是错了

    (gdb) down 1
    #0  RADAR_COORDINATE_SERVER::PicServ_SDK_Recv (this=this@entry=0x3569478, sockfd=1154, pbuf=0x4 
    , pbuf@entry=0x54a05a60 "", buflen=buflen@entry=1024, dwOutlen=0x54a05940, dwOutlen@entry=0x54a05938) at dataManagement/picserv/RadarCoordinate_serv.cpp:237 237 dwPackLen = ntohl(pHead->length); (gdb) p pHead $8 = (NETRET_HEADER *) 0x4 (gdb) p &pbuf Address requested for identifier "pbuf" which is in register $r5 (gdb) p &pHead Address requested for identifier "pHead" which is in register $r5 (gdb) p $r5 $10 = 4

值得注意的是下面这句打印:

  • gdb打印
    (gdb) p pbuf
    $5 = 0x54a05a60 ""
    (gdb) p &buff[0]
    $6 = (unsigned int *) 0x54a05a68
    
  • 代码中是这样写的:
    UINT32 	buff[PICSERV_BUFFERLEN] = {0};
    char	*pbuf = (char *)buff;
    
    • 这两个地址明明是一样的,但是打印来看却差了8个字节
    • 不明白写这代码的人是什么意图,也没看到注释,明明后缀是.cpp,编译也用g++的编译工具链,但是文件里却用的C风格的显示强制类型转换,说他是错的吧,但是代码又稳定运行几十年了,第一次在这个地方出现错误,要出现应该早该出现了。

题外话:关于显示转换

  • 在C++ prime第5版的4.11.3节,介绍了C++中的4中强制转换方式,static_cast、dynamic_cast、const_cast、reinterpret_cast
  • 关于reinterpret_cast,有如下原文:
    假设有如下转换:
        int *p;
        char *pc=reinterpret_cast<char *>(ip);
    我们必须牢记pc所指的真实对象是一个int而非字符串,如果把pc当成普通的字符指针使用就可能在运行时发生错误,如:
    string str(pc);
    ...
    
  • 紧接着,下面还有一段waring

    waring:reinterpret_cast本质上依赖于机器,要想安全的使用reinterpret_cast必须对涉及的类型和编译器实现转换的步骤都非常了解.

  • 关于代码中C风格的转换与reinterpret_cast转换的关系:
    • 执行旧式强制类型转换时,如果const_cast和static_cast也合法,则行为与对应命名转换一致,若不合法,则执行reinterpret_cast类似的功能,代码中用的UINT32 *转成char *作用与reinterpret_cast一样

重启的真正原因

  • 1、重启当然不可能是强制转换引起的,因为pbuf是一个指针,它的值被改变是很危险的,如果指向的是一个非法地址,解引用时就可能发生段错误。
  • 2、因为栈的生长方向是从高地址到低地址生长,所以极有可能是因为对定义在pbuf后的一个变量的越界访问导致pbuf的值被改变。
  • 3、在导致重启的调用PicServ_SDK_Recv()之前,函数Make_SDK_Head()使用了结构体head的地址
    int RADAR_COORDINATE_SERVER::Make_SDK_Head(UINT32 cmdtype, char *pbuff, UINT32 buflen, UINT32 *poutlen, HPR_ADDR_T *pAddr, UINT8 *mac)
    {
        // 1.注意这个结构体长度的检查
        usr_assert(pbuff != NULL && buflen >= sizeof(NETCMD_HEADER) && poutlen != NULL, ERROR);
        
        NETCMD_HEADER *pHead = NULL;    
        INTER_ALARM_SEARCH_COND *pRadarUpPara = NULL;
    
        pHead = (NETCMD_HEADER *)pbuff;
        // ...省略无关代码
        *poutlen = sizeof(NETCMD_HEADER);
        
        // 2.看①处注释的检测,可知下面这句话越界了
        pRadarUpPara = (INTER_ALARM_SEARCH_COND *)(pbuff + sizeof(NETCMD_HEADER));
        pRadarUpPara->dwAlarmComm = DVR_VCA_ALARM;
        pRadarUpPara->bySupport = GET_RADARDATA_SWITCH;
        pHead->length 		= htonl(sizeof(NETCMD_HEADER) + sizeof(INTER_ALARM_SEARCH_COND));
        *poutlen 			= sizeof(NETCMD_HEADER) + sizeof(INTER_ALARM_SEARCH_COND);
        pHead->checkSum = htonl(checkByteSum((char *)&(pHead->netCmd), sizeof(NETCMD_HEADER)-12));
    
        return OK;
    }
    
  • 4、结论:原因就是在Make_SDK_Head()中对结构体head的越界读写,改变了在同一个栈空间,高地址的pbuff变量的值,导致在后续操作,对pbuff解引用时,概率性的发生了段错误。
  • 5、引入的原因:这个发生在代码移植的时候,因为需要对消息头进行扩展,但是参数检查消息头的定义均未及时更改,才导致了这个让人头大的概率性发生的段错误。

你可能感兴趣的:(调试经验)