该主机与 Cloudera Manager Server 失去联系的时间过长。 该主机未与 Host Monitor 建立联系

项目场景:

通过CDH集群对大数据集群实行监控管理


问题描述:

某台服务器主机与 Cloudera Manager Server 失去联系的时间过长,到时候该主机相关实例角色(组件实例)停止,严重影响集群稳定性以及任务执行。然后重启该主机agent服务失败。
该主机与 Cloudera Manager Server 失去联系的时间过长。 该主机未与 Host Monitor 建立联系_第1张图片

[root@dn hadoop-yarn]# systemctl status cloudera-scm-agent
● cloudera-scm-agent.service - LSB: Cloudera SCM Agent
   Loaded: loaded (/etc/rc.d/init.d/cloudera-scm-agent; bad; vendor preset: disabled)
   Active: failed (Result: exit-code) since 二 2021-06-22 11:18:10 CST; 1min 16s ago
     Docs: man:systemd-sysv-generator(8)
  Process: 29729 ExecStart=/etc/rc.d/init.d/cloudera-scm-agent start (code=exited, status=203/EXEC)

622 11:18:10 dn systemd[1]: Starting LSB: Cloudera SCM Agent...
622 11:18:10 dn systemd[1]: cloudera-scm-agent.service: control process exited, code=exited status=203
622 11:18:10 dn systemd[1]: Failed to start LSB: Cloudera SCM Agent.
622 11:18:10 dn systemd[1]: Unit cloudera-scm-agent.service entered failed state.
622 11:18:10 dn systemd[1]: cloudera-scm-agent.service failed.
Warning: cloudera-scm-agent.service changed on disk. Run 'systemctl daemon-reload' to reload units.

提示
/usr/sbin/cmf-agent:行48: /usr/lib64/cmf/agent/build/env/bin/cmf-agent: 没有那个文件或目录

[23/Jun/2021 10:22:30 +0000] 7814 MainThread agent        INFO     Missing database jar: /usr/share/java/mysql-connector-java.jar (normal, if you're not using this database type)
[23/Jun/2021 10:22:30 +0000] 7814 MainThread agent        INFO     Missing database jar: /usr/share/java/oracle-connector-java.jar (normal, if you're not using this database type)
[23/Jun/2021 10:22:30 +0000] 7814 MainThread agent        INFO     Found database jar: /usr/share/cmf/lib/postgresql-9.0-801.jdbc4.jar
[23/Jun/2021 10:22:30 +0000] 7814 Dummy-1 daemonize    WARNING  Stopping daemon.

原因分析:

该主机/etc/rc.d/init.d/路径下cloudera-scm-agent莫名其妙消失了,很诡异!!!进而发现/usr/lib64/cmf/agent/路径文件也有丢失情况,执行 “netstat -apn |grep 7180 ” 发现7180端口服务也没起来,说明该主机已经不受CDH管控了。


解决方案:

一通操作猛如虎,从其他服务器拷贝文件至/usr/lib64/cmf/agent/路径下,将mysql-connector-java.jar,oracle-connector-java.jar 拷贝至/usr/share/java/路径下,再次启动agent服务。

[root@dn cloudera-scm-agent]# /etc/init.d/cloudera-scm-agent start
Starting cloudera-scm-agent (via systemctl):               [  确定  ]

agent服务正常启动,但是Cloudera Manager Server 与该主机还是未建立联系,接着排查日志


[root@dn cloudera-scm-agent]# tail -f /var/log/cloudera-scm-agent/cloudera-scm-agent.log
[23/Jun/2021 14:52:02 +0000] 11784 MainThread agent        INFO     Using Host ID: 1cea0f69-35c4-405d-9a79-9786a0aae310
[23/Jun/2021 14:52:02 +0000] 11784 MainThread agent        INFO     Using directory: /run/cloudera-scm-agent
[23/Jun/2021 14:52:02 +0000] 11784 MainThread agent        INFO     Using supervisor binary path: /usr/lib64/cmf/agent/d
[23/Jun/2021 14:52:02 +0000] 11784 MainThread agent        INFO     Neither verify_cert_file nor verify_cert_dir are co validation of server certificates in HTTPS communication. These options can be configured in this agent's config.ini fe validation.
[23/Jun/2021 14:52:02 +0000] 11784 MainThread agent        INFO     Agent Logging Level: INFO
[23/Jun/2021 14:52:02 +0000] 11784 MainThread agent        INFO     No command line vars
[23/Jun/2021 14:52:02 +0000] 11784 MainThread agent        INFO     Found database jar: /usr/share/java/mysql-connector
[23/Jun/2021 14:52:02 +0000] 11784 MainThread agent        INFO     Found database jar: /usr/share/java/oracle-connecto
[23/Jun/2021 14:52:02 +0000] 11784 MainThread agent        INFO     Found database jar: /usr/share/cmf/lib/postgresql-9
[23/Jun/2021 14:52:02 +0000] 11784 Dummy-1 daemonize    WARNING  Stopping daemon.

看到这个结果又傻了,又是一顿操作猛如虎,各大网站查,下面这段提示给了灵感

Neither verify_cert_file nor verify_cert_dir are co validation of server certificates in HTTPS communication. These options can be configured in this agent's config.ini fe validation.

经查config.ini找发现/etc/cloudera-scm-agent有类似文件

[root@dn cloudera-scm-agent]# ll
总用量 36
-rw-r--r-- 1 root root 8894 6月  23 14:51 config.ini.orig
-rw-r--r-- 1 root root 8879 5月  18 2018 config.ini.rpmsave

经过与其他从节点服务器对比,发现/etc/cloudera-scm-agent路径下少了一个配置文件config.ini,从其他服务器拷贝至该目录下,另外发现上述config.ini.orig文件server_host=localhost,这才是问题所在(为什么无法连接Cloudera Manager Server)。果断改为server真正的host,
该主机与 Cloudera Manager Server 失去联系的时间过长。 该主机未与 Host Monitor 建立联系_第2张图片
再次确认config.ini配置文件 server_host是否为主机IP
该主机与 Cloudera Manager Server 失去联系的时间过长。 该主机未与 Host Monitor 建立联系_第3张图片

再次重启

[root@dn cloudera-scm-agent]# /etc/rc.d/init.d/cloudera-scm-agent start

监控日志

[root@dn cloudera-scm-agent]# tail -f /var/log/cloudera-scm-agent/cloudera-scm-agent.log

该主机与 Cloudera Manager Server 失去联系的时间过长。 该主机未与 Host Monitor 建立联系_第4张图片
agent服务已经正常启动,加载配置文件和数据。正常日志打印。大功告成,历时两天时间,终于落地。 日志有时候真的你很无助的时候,会给你希望,柳暗花明又一村!!!

其他:
启动Agents
/etc/rc.d/init.d/cloudera-scm-agent start
service cloudera-scm-agent start
检查Agents状态
service cloudera-scm-agent status
停止Agents
service cloudera-scm-agent stop
重启Agents
service cloudera-scm-agent restart
清理重启Agents
service cloudera-scm-agent clean_start
强制停止Agents
service cloudera-scm-agent hard_stop
强制重启Agents
service cloudera-scm-agent hard_restart

你可能感兴趣的:(集群运维,cloudera,hadoop,hdfs,大数据)