客户质询的现象是:
Slony-I运行中,log中发现FATAL信息:
FATAL storeListen: unknown node ID 3
出现了上述错误后,再看后继的log,又恢复正常运行了。
客户的问题在于:如何看待这个错误信息,它是否是设计上就是这样的?
言外之意,这到底是否是一个bug?
设计上是否是这样,是无从知晓的,只有问Vendor。而我的想法是,先分析源代码看看:
/* ---------- * SlonWatchdog * ---------- */ static void SlonWatchdog(void) { … slon_log(SLON_INFO, "slon: watchdog process started\n"); slon_log(SLON_CONFIG, "slon: watchdog ready - pid = %d\n", slon_watchdog_pid); slon_worker_pid = fork(); if (slon_worker_pid == 0) { SlonMain(); exit(-1); } … if (install_signal_handler(SIGUSR1,sighandler) == SIG_ERR) { slon_log(SLON_FATAL, "slon: SIGUSR1 signal handler setup failed -(%d) %s\n", errno, strerror(errno)); slon_exit(-1); } … slon_log(SLON_CONFIG, "slon: worker process created - pid = %d\n", slon_worker_pid); while(!shutdown) { while ((pid = wait(&child_status)) != slon_worker_pid) { … } … slon_log(SLON_CONFIG, "slon: child terminated %s: %d; pid: %d, current worker pid: %d\n", termination_reason,return_code, pid, slon_worker_pid); switch (watchdog_status) { … case SLON_WATCHDOG_NORMAL: case SLON_WATCHDOG_RETRY: watchdog_status = SLON_WATCHDOG_RETRY; if (child_status != 0) { slon_log(SLON_CONFIG, "slon: restart of worker in 10 seconds\n"); (void)sleep(10); } else { slon_log(SLON_CONFIG, "slon: restart of worker\n"); }
if (watchdog_status == SLON_WATCHDOG_RETRY) { slon_worker_pid=fork(); if(slon_worker_pid == 0) { worker_restarted=1; SlonMain(); exit(-1); } … watchdog_status=SLON_WATCHDOG_NORMAL; continue; } break; default: shutdown=1; break; } /*switch*/ }/*while*/ … }
/* ---------- * SlonMain * ---------- */ static void SlonMain(void) { … for (i = 0, n = PQntuples(res); i < n; i++) { … rtcfg_storePath(pa_server, pa_conninfo, pa_connretry); } PQclear(res); … }
/* ---------- * rtcfg_storePath * ---------- */ void rtcfg_storePath(int pa_server, char *pa_conninfo, int pa_connretry) { … /* * Store the (new) conninfo to the node */ slon_log(SLON_CONFIG, "storePath: pa_server=%d pa_client=%d pa_conninfo=\"%s\" pa_connretry=%d\n",
pa_server, rtcfg_nodeid, pa_conninfo, pa_connretry); … /* * Eventually start communicating with that node */ rtcfg_startStopNodeThread(node); }
/* ---------- * rtcfg_startStopNodeThread * ---------- */ static void rtcfg_startStopNodeThread(SlonNode * node) { … if (sched_get_status() == SCHED_STATUS_OK && node->no_active) { /* * Make sure the node worker exists */ switch (node->worker_status) { case SLON_TSTAT_NONE: if (pthread_create(&(node->worker_thread), NULL, remoteWorkerThread_main, (void *)node) < 0) { … } node->worker_status = SLON_TSTAT_RUNNING; break; … } } … }
/* ---------- * slon_remoteWorkerThread * * Listen for events on the local database connection. This means, events * generated by the local node only. * ---------- */ void * remoteWorkerThread_main(void *cdata) { … while (true) { … else /* not SYNC */ { … else if (strcmp(event->ev_type, "STORE_LISTEN") == 0) { … if (li_receiver == rtcfg_nodeid) rtcfg_storeListen(li_origin, li_provider); … } … } … } … }
/* ---------- * rtcfg_storeListen * ---------- */ void rtcfg_storeListen(int li_origin, int li_provider) { … node = rtcfg_findNode(li_provider); if (!node) { slon_log(SLON_FATAL,"storeListen: unknown node ID %d\n", li_provider); slon_retry(); return; } … }
#define slon_retry() \ do { \ pthread_mutex_lock(&slon_watchdog_lock); \ if (slon_watchdog_pid >= 0) { \ slon_log(SLON_DEBUG2, "slon_retry() from pid=%d\n", slon_pid); \ (void) kill(slon_watchdog_pid, SIGUSR1); \ slon_watchdog_pid = -1; \ } \ pthread_mutex_unlock(&slon_watchdog_lock); \ pthread_exit(NULL); \ } while (0)
/* ---------- * sighandler * ---------- */ static void sighandler(int signo) { switch (signo) { … case SIGUSR1: watchdog_status = SLON_WATCHDOG_RETRY; slon_terminate_worker(); break; … } }
/* ---------- * slon_terminate_worker * ---------- */ void slon_terminate_worker() { (void) kill(slon_worker_pid, SIGKILL); }
上述是对代码的简略整理。
在其中:
SlonWatchdog函数中,通过fork生成子进程。
此子进程的SlonMain函数里、通过rtcfg_storePath --> rtcfg_storePath -->rtcfg_startStopNodeThread的调用关系,
作了一个线程,该线程启动是,调用 remoteWorkerThread_main 函数。
remoteWorkerThread_main函数里,调用rtcfg_storeListen函数的时候,
如果获得 Node情报的时候,发生了错误,就会导致向SlonWatchdog运行时的主进程发送SIGUSR信号。
另一方面:
主进程的SlonWatchdog函数中,早已经准备了对应SIGUSR信号的函数sighandler。
在此sighandler函数中,SIGUSR信号发生时,会把上述的子进程kill掉。
而且,此主进程中通过wait调用,准备好了当上述子进程一旦被kill掉或者自己死掉时的代码逻辑:
通过while循环,再次采用fork操作,调用fork后子进程的SlonMain函数,一切又周而复始了:
如果SlonMain函数调用rtcfg_storeListen失败,就再次发生死亡,回到主进程再次fork;
如果成功,就跳出循环,进入下一步的处理。