最近公司的两个salt-master要合并管理,所以我就将其中一个master做了个salt-syndic,在安装“salt-syndic”的过程中自动升级了salt-master和salt-minion,并重启了master和syndic,其中发生了一些位置的问题,导致salt-master的“/var/run/salt/master/publish_pull.ipc”文件损坏,导致salt无法进行正常启动和通讯,而且这个文件无法删除,最终我修改了文件夹的名字重新启动salt-master,服务恢复正常。
第一步
这是一次在线升级salt-master后,重启发现master进程启动但是,无法打开端口和误以为自己未启动,
执行行命令报错
[root@ntest1 ~]# tail -f /var/log/salt/master 2015-11-16 18:53:10,166 [salt.client ][ERROR ][3103] Unable to connect to the salt master publisher at /var/run/salt/master 2015-11-16 18:54:02,895 [salt.client ][ERROR ][3763] Unable to connect to the salt master publisher at /var/run/salt/master
第二步
停掉起来发现地址被占用,查看发现为占用
[root@ntest1 salt]# /etc/init.d/salt-master start Starting salt-master daemon: WARNING: Unable to bind socket, error: [Errno 98] Address already in use The ports are notavailable to bind [FAILED]
第三步
继续重启发现正常启动
[root@ntest1 salt]# /etc/init.d/salt-master start
Starting salt-master daemon: [ OK ]
执行命令报错,这里可以确定salt-master是不正常的,原因就是因为“ /var/run/salt/master”目录下的publisher
[root@ntest1 salt]# salt '*' test.ping [ERROR ] Unable to connect to the salt master publisher at /var/run/salt/master The salt master could not be contacted. Is master running?
打开debug日志发现下面日志,果然是这个
2015-11-16 19:02:59,379 [salt.utils.process ][DEBUG ][6976] Started 'salt.master.<type 'type'>.Publisher' with pid 7627
2015-11-16 19:02:59,381 [salt.master ][INFO ][7627] Starting the Salt Publisher on tcp://0.0.0.0:4505
2015-11-16 19:02:59,382 [salt.master ][INFO ][7627] Starting the Salt Puller on ipc:///var/run/salt/master/publish_pull.ipc
2015-11-16 19:02:59,391 [salt.utils.process ][INFO ][6976] Process <class 'salt.master.Publisher'> (7627) died with exit status None, restarting...
2015-11-16 19:03:00,395 [salt.utils.process ][DEBUG ][6976] Started 'salt.master.<type 'type'>.Publisher' with pid 7630
2015-11-16 19:03:00,396 [salt.master ][INFO ][7630] Starting the Salt Publisher on tcp://0.0.0.0:4505
2015-11-16 19:03:00,397 [salt.master ][INFO ][7630] Starting the Salt Puller on ipc:///var/run/salt/master/publish_pull.ipc
2015-11-16 19:03:00,406 [salt.utils.process ][INFO ][6976] Process <class 'salt.master.Publisher'> (7630) died with exit status None, restarting...
2015-11-16 19:03:01,409 [salt.utils.process ][DEBUG ][6976] Started 'salt.master.<type 'type'>.Publisher' with pid 7633
2015-11-16 19:03:01,411 [salt.master ][INFO ][7633] Starting the Salt Publisher on tcp://0.0.0.0:4505
2015-11-16 19:03:01,412 [salt.master ][INFO ][7633] Starting the Salt Puller on ipc:///var/run/salt/master/publish_pull.ipc
2015-11-16 19:03:01,421 [salt.utils.process ][INFO ][6976] Process <class 'salt.master.Publisher'> (7633) died with exit status None, restarting...
2015-11-16 19:03:02,424 [salt.utils.process ][DEBUG ][6976] Started 'salt.master.<type 'type'>.Publisher' with pid 7636
2015-11-16 19:03:02,426 [salt.master ][INFO ][7636] Starting the Salt Publisher on tcp://0.0.0.0:4505
2015-11-16 19:03:02,427 [salt.master ][INFO ][7636] Starting the Salt Puller on ipc:///var/run/salt/master/publish_pull.ipc
2015-11-16 19:03:02,435 [salt.utils.process ][INFO ][6976] Process <class 'salt.master.Publisher'> (7636) died with exit status None, restarting...
2015-11-16 19:03:03,439 [salt.utils.process ][DEBUG ][6976] Started 'salt.master.<type 'type'>.Publisher' with pid 7639
2015-11-16 19:03:03,441 [salt.master ][INFO ][7639] Starting the Salt Publisher on tcp://0.0.0.0:4505
2015-11-16 19:03:03,442 [salt.master ][INFO ][7639] Starting the Salt Puller on ipc:///var/run/salt/master/publish_pull.ipc
2015-11-16 19:03:03,512 [salt.utils.process ][INFO ][6976] Process <class 'salt.master.Publisher'> (7639) died with exit status None, restarting...
2015-11-16 19:03:04,516 [salt.utils.process ][DEBUG ][6976] Started 'salt.master.<type 'type'>.Publisher' with pid 7642
2015-11-16 19:03:04,517 [salt.master ][INFO ][7642] Starting the Salt Publisher on tcp://0.0.0.0:4505
这里注意
查看目录下文件,果然有个“publish_pull.ipc”的文件出现损坏,这些本是salt启动生成,删除重新启动,就恢复正常。
[root@ntest1 master]# ll
ls: cannot access publish_pull.ipc: Input/output error
total 0
srwxrwxrwx 1 root root 0 Nov 16 19:06 master_event_pub.ipc
srwxrwxrwx 1 root root 0 Nov 16 19:06 master_event_pull.ipc
s????????? ? ? ? ? ? publish_pull.ipc
srwxrwxrwx 1 root root 0 Nov 16 19:06 workers.ipc
[root@ntest1 master]# ps -ef|grep salt
第四步
删除文件,发现各种方法都无法删除“publish_pull.ipc”,最终退而求其次把“master”目录修改名字,
salt-master服务恢复正常(但是有时候无法删除就得需要第五步了)
[root@ntest1 salt]# rm master -rf rm: cannot remove `master/publish_pull.ipc': Input/output error [root@ntest1 salt]#ls master minion [root@ntest1 salt]# mv master 123 [root@ntest1 salt]#ll total 8 drwxrwxrwx 2 root root 4096 Nov 16 19:57 123 drwxrwxrwx 2 root root 4096 Nov 16 19:07 minion [root@ntest1 salt]# /etc/init.d/salt-master start Starting salt-master daemon: [ OK ] [root@ntest1 salt]# salt '*' test.ping ntest1.dianjoy.com: True ntest2.dianjoy.com: True