在做ganglia的puppet的统一管理,同步时发现服务无法正常启动,

# puppetd --test
info: Retrieving plugin
info: Loading facts in /var/lib/puppet/lib/facter/haproxyrunning.rb
info: Loading facts in /var/lib/puppet/lib/facter/kernel_mod_ip_conntrack.rb
info: Loading facts in /var/lib/puppet/lib/facter/kernel_mod_nf_conntrack.rb
info: Caching catalog for taurus-bj20.internal.gexing.com
info: Applying configuration version '1377502166'
notice: /Stage[main]/Common/Exec[find /var/lib/puppet/clientbucket -name paths -execdir cat {} \; -execdir pwd \; -execdir date -r {} +"%F %T" \; -exec echo \; > /var/log/puppet/clientbucket.log]/returns: executed successfully
notice: /Stage[main]/Ganglia::Service/Service[gmond]/ensure: ensure changed 'stopped' to 'running'
notice: Finished catalog run in 9.40 seconds

同步是成功的,回馈的信息现实服务已经启动起来了,但实际上却没有相关进程,查看状态发现

service gmond status 时显示:
gmond dead but subsys locked

查看日志

# tail -f /var/log/messages 
Aug 26 15:29:27 taurus-bj20 puppet-agent[22030]: Applying configuration version '1377502166'
Aug 26 15:29:33 taurus-bj20 puppet-agent[22030]: (/Stage[main]/Common/Exec[find /var/lib/puppet/clientbucket -name paths -execdir cat {} \; -execdir pwd \; -execdir date -r {} +"F T" \; -exec echo \; > /var/log/puppet/clientbucket.log]/returns) executed successfully
Aug 26 15:29:36 taurus-bj20 /usr/sbin/gmond[22620]: Error creating UDP server on port 8649 bind=taurus-bj20. Exiting.#012
Aug 26 15:29:36 taurus-bj20 puppet-agent[22030]: (/Stage[main]/Ganglia::Service/Service[gmond]/ensure) ensure changed 'stopped' to 'running'
Aug 26 15:29:37 taurus-bj20 puppet-agent[22030]: Finished catalog run in 9.40 seconds
Aug 26 15:30:03 taurus-bj20 /usr/sbin/gmond[22684]: Error creating UDP server on port 8649 bind=taurus-bj20. Exiting.#012
Aug 26 15:30:30 taurus-bj20 /usr/sbin/gmond[22755]: Error creating UDP server on port 8649 bind=taurus-bj20. Exiting.#012
Aug 26 15:30:31 taurus-bj20 /usr/sbin/gmond[22774]: Error creating UDP server on port 8649 bind=taurus-bj20. Exiting.#012
Aug 26 15:30:32 taurus-bj20 /usr/sbin/gmond[22793]: Error creating UDP server on port 8649 bind=taurus-bj20. Exiting.#012
Aug 26 15:30:33 taurus-bj20 /usr/sbin/gmond[22812]: Error creating UDP server on port 8649 bind=taurus-bj20. Exiting.#012

查看端口,8649并没在用,那很有可能是本地的网络问题,因为我定义节点是采用的puppet的facter取的hostname


udp_recv_channel {
 port = 8649
 bind = taurus-bj20
}

而hosts文件

172.16.2.19   taurus-bj19
172.16.2.20   taurus-bj19     #太粗心了,这竟然写错了

问题找到了,原来是这个主机名无法被识别,修改主机信息后,再次同步
进程启动起来了


总结进程启动不起来的原因是因为gmond在创建本地的8649端口监控时的无法连接问题造成的!