周六天不好,还被叫去加班写文档,心情很不愉快;周日阳光明媚,高高兴兴晃荡晃荡去加班调BUG
问题:main函数调用dpdk静态库函数rte_eal_remote_launch,传入回调函数指针capture_core,以及capture_core要用到的结构体指针config,但是capture_core被调用后发现main设置的config里的一个指针成员mutex值不对了,结果__sync_bool_compare_and_swap (config->mutex, lock, 1)报segmentfault
结构体:
struct core_capture_config {
struct rte_ring * ring[RING_MAX];
bool volatile * stop_condition;
struct core_capture_stats * stats;
uint8_t port;
uint8_t queue;
unsigned int ring_num;
hashtable_t *ht;
int *mutex;
int i;
unsigned long bond_ip;
};
思路:
1.多线程什么地方把内存给破坏了?
停了所有其他线程,可是还是一样的问题
2.用到valgrind看看什么地方内存使用有问题?
然而用了dpdk,valgrind跑不起来,ERROR: This system does not support "RDRAND" valgrind dpdk,网上说是valgrind的缺陷,可是提供的解决方案补丁也没好用
而且进一步发现在main函数里看config成员值并没有变
3.想不通,难道rte_eal_remote_launch使用的config并不是我传的config,而是复制了一份?
gdb看了一下两个config地址发现是一样的。。。
虽然基本排除dpdk库的问题,而且心想有问题得排查自己的代码,像DPDK这种INTEL提供的库你还想轻易找个BUG,但毕竟人家是开源的
lib/librte_eal/linuxapp/eal/eal_thread.c:
主要函数:
/*
* Send a message to a slave lcore identified by slave_id to call a
* function f with argument arg. Once the execution is done, the
* remote lcore switch in FINISHED state.
*/
int
rte_eal_remote_launch(int (*f)(void *), void *arg, unsigned slave_id)
{
int n;
char c = 0;
int m2s = lcore_config[slave_id].pipe_master2slave[1];
int s2m = lcore_config[slave_id].pipe_slave2master[0];
if (lcore_config[slave_id].state != WAIT)
return -EBUSY;
lcore_config[slave_id].f = f;
lcore_config[slave_id].arg = arg;
/* send message */
n = 0;
while (n == 0 || (n < 0 && errno == EINTR))
n = write(m2s, &c, 1);
if (n < 0)
rte_panic("cannot write on configuration pipe\n");
/* wait ack */
do {
n = read(s2m, &c, 1);
} while (n < 0 && errno == EINTR);
if (n <= 0)
rte_panic("cannot read on configuration pipe\n");
return 0;
}
/* main loop of threads */
__attribute__((noreturn)) void *
eal_thread_loop(__attribute__((unused)) void *arg)
{
char c;
int n, ret;
unsigned lcore_id;
pthread_t thread_id;
int m2s, s2m;
char cpuset[RTE_CPU_AFFINITY_STR_LEN];
thread_id = pthread_self();
/* retrieve our lcore_id from the configuration structure */
RTE_LCORE_FOREACH_SLAVE(lcore_id) {
if (thread_id == lcore_config[lcore_id].thread_id)
break;
}
if (lcore_id == RTE_MAX_LCORE)
rte_panic("cannot retrieve lcore id\n");
m2s = lcore_config[lcore_id].pipe_master2slave[0];
s2m = lcore_config[lcore_id].pipe_slave2master[1];
/* set the lcore ID in per-lcore memory area */
RTE_PER_LCORE(_lcore_id) = lcore_id;
/* set CPU affinity */
if (eal_thread_set_affinity() < 0)
rte_panic("cannot set affinity\n");
ret = eal_thread_dump_affinity(cpuset, RTE_CPU_AFFINITY_STR_LEN);
RTE_LOG(DEBUG, EAL, "lcore %u is ready (tid=%x;cpuset=[%s%s])\n",
lcore_id, (int)thread_id, cpuset, ret == 0 ? "" : "...");
/* read on our pipe to get commands */
while (1) {
void *fct_arg;
/* wait command */
do {
n = read(m2s, &c, 1);
} while (n < 0 && errno == EINTR);
if (n <= 0)
rte_panic("cannot read on configuration pipe\n");
lcore_config[lcore_id].state = RUNNING;
/* send ack */
n = 0;
while (n == 0 || (n < 0 && errno == EINTR))
n = write(s2m, &c, 1);
if (n < 0)
rte_panic("cannot write on configuration pipe\n");
if (lcore_config[lcore_id].f == NULL)
rte_panic("NULL function pointer\n");
/* call the function and store the return value */
fct_arg = lcore_config[lcore_id].arg;
ret = lcore_config[lcore_id].f(fct_arg);
lcore_config[lcore_id].ret = ret;
rte_wmb();
lcore_config[lcore_id].state = FINISHED;
}
/* never reached */
/* pthread_exit(NULL); */
/* return NULL; */
}
可以看出他对参数arg是没有额外处理的,而且在这里加了输出语句发现一进rte_eal_remote_launch,config->mutex指针的值就变了,真是神奇。。。
3.静态库和main函数malloc出来的地址空间不一致?
这个之前在win上使用dll遇到过,但是linux也没查到相关内容,而且这块代码之前都是好使的,调用方法也是标准的
4.进一步gdb调试发现结构体成员值有点串,相邻成员之间的值好像拼在一起了
发现相邻成员混在一起猜测是对齐的问题,gdb打印各变量地址:
函数调用外