cloudera-scm-server dead but pid file exists 问题分析和解决

启动服务cloudera-scm-server时会遇到过一段时间自己挂掉,并返回cloudera-scm-server dead but pid file exists的问题。

以下为根源在cloudera-scm-server-db没有正常启动的情况。


【过程】

cloudera-scm-server启动后过一段时间自己挂掉

[root@gyvm-4 data]# service cloudera-scm-server start
Starting cloudera-scm-server:                              [  OK  ]
[root@gyvm-4 data]# 
[root@gyvm-4 data]# service cloudera-scm-server status
cloudera-scm-server (pid  60761) is running...
[root@gyvm-4 data]# service cloudera-scm-server status
cloudera-scm-server (pid  60761) is running...
[root@gyvm-4 data]# service cloudera-scm-server status
cloudera-scm-server (pid  60761) is running...
[root@gyvm-4 data]# service cloudera-scm-server status
cloudera-scm-server dead but pid file exists

这时候想要完整重启cloudera-scm server-db/server

发现cloudera-scm-server-db无法重启

[root@gyvm-4 data]# service cloudera-scm-server-db stop
waiting for server to shut down............................................................... failed
pg_ctl: server does not shut down

无法停止server-db的原因是残留了一个pid文件,status显示不正确,删除该文件,通过status查看,server-db其实已经停止了。

[root@gyvm-4 data]# cd /var/lib/cloudera-scm-server-db/data
[root@gyvm-4 data]# service cloudera-scm-server-db status
pg_ctl: server is running (PID: 17378)
/usr/bin/postgres "-D" "/var/lib/cloudera-scm-server-db/data"
[root@gyvm-4 data]# rm postmaster.pid
rm: remove regular file `postmaster.pid'? y
[root@gyvm-4 data]# service cloudera-scm-server-db status
pg_ctl: no server running

此时启动server-db,失败

[root@gyvm-4 data]# service cloudera-scm-server-db start
DB initialization done.
waiting for server to start...............................................................could not start server

查看log,tcp/ip端口7432 被占用

[root@gyvm-4 cloudera-scm-server]# tail db.log 
LOG:  could not bind IPv4 socket: Address already in use
HINT:  Is another postmaster already running on port 7432? If not, wait a few seconds and retry.
LOG:  could not bind IPv6 socket: Address already in use
HINT:  Is another postmaster already running on port 7432? If not, wait a few seconds and retry.
WARNING:  could not create listen socket for "*"
FATAL:  could not create any TCP/IP sockets

杀掉占用该端口的进程

[root@gyvm-4 cloudera-scm-server]# netstat -ntp | grep 7432
tcp        0      0 192.168.1.17:7432           192.168.1.17:49784          ESTABLISHED 37118/postgres      
tcp        0      0 192.168.1.17:7432           192.168.1.8:35818           ESTABLISHED 36807/postgres      
tcp        0      0 192.168.1.17:7432           192.168.1.17:49779          ESTABLISHED 37060/postgres      
tcp        0      0 192.168.1.17:49783          192.168.1.17:7432           ESTABLISHED 36306/java          
tcp        0      0 192.168.1.17:7432           192.168.1.8:35813           ESTABLISHED 36778/postgres      
tcp        0      0 192.168.1.17:49779          192.168.1.17:7432           ESTABLISHED 36306/java          
tcp        0      0 192.168.1.17:49784          192.168.1.17:7432           ESTABLISHED 36306/java          
tcp        0      0 192.168.1.17:49778          192.168.1.17:7432           ESTABLISHED 36306/java          
tcp        0      0 192.168.1.17:7432           192.168.1.17:49778          ESTABLISHED 37059/postgres      
tcp        0      0 192.168.1.17:7432           192.168.1.8:35814           ESTABLISHED 36779/postgres      
tcp        0      0 192.168.1.17:7432           192.168.1.8:35817           ESTABLISHED 36804/postgres      
tcp        0      0 192.168.1.17:7432           192.168.1.17:49783          ESTABLISHED 37117/postgres      
[root@gyvm-4 cloudera-scm-server]# kill -9 37118

再次开启server-db,成功,启动server,成功。

[root@gyvm-4 data]# service cloudera-scm-server-db start
DB initialization done.
waiting for server to start.... done
server started

[root@gyvm-4 data]# service cloudera-scm-server start
Starting cloudera-scm-server:                              [  OK  ]

此时,cloudera管理界面可以正常访问。

 

【结论】

究其原因,是cloudera-server-db没有正常启动,但是残留了pid文件postmaster.pid。

所以查看cloudera-server-db状态时,显示有误,返回cloudera-server-db是启动的状态。

在此基础上,每次启动cloudera-server就会失败。

而cloudera-server-db启动失败的原因是该服务需要的端口号被占用。


你可能感兴趣的:(cloudera)