(一)keepalived介绍:
Keepalived是Linux下轻量级的高可用的解决方案。Keepalived主要是通过虚拟路由冗余来实现高可用功能,具有部署和使用非常简单,只需一个配置文件即可。它是根据TCP/IP参考模型的第三,第四和第五层交换机检测到每个服务接点的状态,如果某个服务节点出现异常,或者出现故障,keepalived将检测到,并将出现故障的服务节点从集群中剔除,而在故障节点恢复后,keepalived又可以自动将此服务节点重新加入服务器集群中,这些工作全部自动完成,不需要人工干涉,需要人工完成的只是修复出现故障的服务节点。(2台服务器运行Keepalived,一台为主服务器(MASTER),一台为备份服(BACKUP),但是对外表现为一个虚拟IP,主服务器会发送特定的消息给备份服务器,当备份服务器收不到这个消息的时候,即主服务器宕机的时候, 备份服务器就会接管虚拟IP,继续提供服务,从而保证了高可用性)。keepalived采用的是vrrp来实现的。
Keepalived是一个高可用解决方案,主要是用来防止服务器单点发生故障,可以通过和Nginx配合来实现Web服务的高可用。(其实,Keepalived不仅仅可以和Nginx配合,还可以和很多其他服务配合)
Keepalived+Nginx实现高可用的思路:
第一:请求不要直接打到Nginx上,应该先通过Keepalived(这就是所谓虚拟IP,VIP)
第二:Keepalived应该能监控Nginx的生命状态(提供一个用户自定义的脚本,定期检查Nginx进程状态,进行权重变化,,从而实现Nginx故障切换)
(二)安装和配置:
1,编译安装:
[root@localhost install]#cd /tmp/install [root@localhost install]# yum install gcc gcc-c++ openssl* kernel-devel net-snmp* libnl* -y [root@localhost install]#yum install -y libnfnetlink-devel [root@localhost install]#wget http://www.keepalived.org/software/keepalived-1.2.19.tar.gz ###或者直接在官网上下载 [root@localhost install]#tar zxvf keepalived-1.2.19.tar.gz [root@localhost install]#cd keepalived-1.2.19 [root@localhost install]#./configure --prefix=/usr/local/keepalived [root@localhost install]#make && make install
2,配置:
(1)环境说明
hostname |
PI | 安装软件 | OS |
Master |
192.168.122.120 | nginx(已安装).keepalived | Centos6.5 |
backup | 192.168.122.121 | nginx(已安装).keepalived | Centos6.5 |
vip(虚拟IP) |
192.168.122.130 |
备注:依次在192.168.122.120和192.168.122.121进行编译安装如上边的步骤,安装完成之后,会在/usr/local下生成bin、etc、sbin三个文件夹,主要的配置文件在etc目录下,执行文件在sbin目录下
(2)依次在120和121上操作:
(a)####keepalive启动时默认会去读/etc/keepalived/keepalived.conf配置文件,但编译安转后,没有自动创建这个文件,而是放到了/usr/local/etc/keepalived/keepalived.conf
为了以后配置方便,我们手动将/usr/local/etc/keepalived/keepalived.conf放到/etc/keepalived/下
[root@localhost install]#mkdir /etc/keepalived/ [root@localhost install]#cp /usr/local/keepalived/etc/keepalived/keepalived.conf /etc/keepalived/ [root@localhost install]#cp /usr/local/keepalived/etc/sysconfig/keepalived /etc/sysconfig/keepalived
(b)#####建立服务启动脚本
###########CentOS6.X [root@localhost install]#cp /tmp/keepalived-1.4.2/keepalived/etc/init.d/keepalived /etc/init.d/ [root@localhost install]#chmod +x /etc/init.d/keepalived ####下边是centos7.x版本做成服务 [root@localhost install]#cp /tmp/install/keepalived-1.4.0/keepalived/keepalived.service /lib/systemd/system/keepalived.service [root@localhost install]#systemctl enable keepalived.service Created symlink from /etc/systemd/system/multi-user.target.wants/keepalived.service to /usr/lib/systemd/system/keepalived.service.
(c)####更改启动脚本中keepalive启动参数文件的位置,脚本文件路径是/etc/sysconfig/keepalived(如果是通过yum安装,则这个位置就是正确的),而实际文件是在/usr/local/keepalived/etc/sysconfig/keepalived
#sed -i 's#/etc/sysconfig/keepalived#/usr/local/etc/sysconfig/keepalived#g' /etc/init.d/keepalived
[root@localhost install]#vi /etc/init.d/keepalived
#!/bin/sh # # Startup script for the Keepalived daemon # # processname: keepalived # pidfile: /var/run/keepalived.pid # config: /etc/keepalived/keepalived.conf # chkconfig: - 21 79 # description: Start and stop Keepalived # Source function library . /etc/rc.d/init.d/functions # Source configuration file (we set KEEPALIVED_OPTIONS there) . /etc/sysconfig/keepalived RETVAL=0 prog="keepalived" start() { echo -n $"Starting $prog: " daemon keepalived ${KEEPALIVED_OPTIONS} RETVAL=$? echo [ $RETVAL -eq 0 ] && touch /var/lock/subsys/$prog } stop() { echo -n $"Stopping $prog: " killproc keepalived RETVAL=$? echo [ $RETVAL -eq 0 ] && rm -f /var/lock/subsys/$prog } reload() { echo -n $"Reloading $prog: " killproc keepalived -1 RETVAL=$? echo } # See how we were called. case "$1" in start) start ;; stop) stop ;; reload) reload ;; restart) stop start ;; condrestart) if [ -f /var/lock/subsys/$prog ]; then stop start fi ;; status) status keepalived RETVAL=$? ;; *) echo "Usage: $0 {start|stop|reload|restart|condrestart|status}" RETVAL=1 esac exit $RETVAL
(d)给/usr/local/keepalived/sbin/keepalived做个软连接,否则会报错
[root@localhost sbin]# /etc/init.d/keepalived start Starting keepalived: /bin/bash: keepalived: command not found [FAILED]
[root@localhost sbin]# ln -s /usr/local/keepalived/sbin/keepalived /usr/bin/ [root@localhost sbin]# /etc/init.d/keepalived start Starting keepalived: [ OK ] [root@localhost sbin]#
(3)配置keepalived.conf 主要配置三个地方:global_defs、vrrp_instance、virtual_ip address
root@localhost sbin]# vi /etc/keepalived/keepalived.conf ! Configuration File for keepalived global_defs { #全局设置 router_id LVS_DEVEL #定义路由标识信息,相同局域网唯一,可自定义 } vrrp_script check_http { #首先在vrrp_script区域定义脚本名字和脚本执行的间隔和脚本执行的优先级变更 script "/opt/nginx_pid.sh" #然后在实例(vrrp_instance)里面引用,有点类似脚本里面的函数引用一样:先定义,后引用函数名 interval 2 #脚本执行间隔时间 weight 10 #利用权重值和优先级进行运算,从而降低主服务优先级使之变为备服务器(建议先忽略) } vrrp_instance VI_1 { #定义一个vrrp实例名,可自定义 state MASTER #设定初始状态,可以是MASTER或BACKUP,不过当其他节点keepalived启动时会将priority比较大的节点选举为MASTER,因此该项其实没有实质用途。 interface eth0 #具有固定ip的网卡,用来接收和发送vrrp包。如果没有设定mcast_src_ip发送多播数据包的地址,那么将使用这个网卡的ip来发送。 virtual_router_id 51 #虚拟路由的id号,范围为0-255,取值在0-255之间,用来区分多个instance的VRRP组播。相同的VRID为一个组,他将决定多播的MAC地址 priority 99 #当前keepalived的优先级,数字越大,优先级越高 advert_int 1 #组播信息发送间隔,两个节点设置必须一样,默认为1秒 authentication { #设置节点间验证信息,所以所有节点必须一致 auth_type PASS #验证类型 auth_pass 1111 #密码 } track_script { #调用上面定义的检测脚本 chk_http_port weight 20 } virtual_ipaddress { #设置vip,如果不指定网卡,那么默认绑定到interface指定的网卡,可以通过添加dev ethx的方式,绑定到指定的网卡。 192.168.122.130 } }
同理,在121上进行操作:
vi /etc/keepalived/keepalived.conf ! Configuration File for keepalived global_defs { router_id LVS_DEVEL } vrrp_script chk_http_port { script "/root/chk_httpd.sh" interval 2 weight 10 } vrrp_instance VI_1 { state BACKUP#此处为BACKUP interface eth0 virtual_router_id 51 priority 96#此处比MASTER要小 advert_int 1 authentication { auth_type PASS auth_pass 1111 } track_script { /opt/nginx_pid.sh weight 20 } virtual_ipaddress { 192.168.122.130 } }
(4)在两台服务器上创建定义 vrrp_script check_http的脚本(nginx_pid.sh脚本的作用是当nginx关闭时自动启动nginx)
[root@zabbix ~]# cat /opt/nginx_pid.sh
#!/bin/bash
A=`ps -C nginx --no-header |wc -l`
if [ $A -eq 0 ];then
/usr/local/nginx/sbin/nginx
sleep 3
if [ `ps -C nginx --no-header |wc -l` -eq 0 ];then
killall keepalived
fi
fi
(5)120和121同时启动keepalived
[root@hadoop1 ~]# /etc/init.d/keepalived start Starting keepalived: [ OK ] [root@hadoop1 ~]# ip a 1: lo:mtu 16436 qdisc noqueue state UNKNOWN link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 52:54:00:05:7b:31 brd ff:ff:ff:ff:ff:ff inet 192.168.122.120/24 brd 192.168.122.255 scope global eth0 inet 192.168.122.130/32 scope global eth0 inet6 fe80::5054:ff:fe05:7b31/64 scope link valid_lft forever preferred_lft forever
[root@hadoop2 ~]# /etc/init.d/keepalived start Starting keepalived: [ OK ] [root@hadoop2 ~]# ip a 1: lo:mtu 16436 qdisc noqueue state UNKNOWN link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 52:54:00:c6:65:80 brd ff:ff:ff:ff:ff:ff inet 192.168.122.121/24 brd 192.168.122.255 scope global eth0 inet6 fe80::5054:ff:fec6:6580/64 scope link valid_lft forever preferred_lft forever
打开浏览器进行查看:
(5)验证实验结果:
实验一:关闭120的keepalived服务看是否切换:
[root@hadoop1 ~]# ip a 1: lo:mtu 16436 qdisc noqueue state UNKNOWN link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 52:54:00:05:7b:31 brd ff:ff:ff:ff:ff:ff inet 192.168.122.120/24 brd 192.168.122.255 scope global eth0 inet6 fe80::5054:ff:fe05:7b31/64 scope link valid_lft forever preferred_lft forever
在查看下121的IP地址:
[root@hadoop2 ~]# ip a 1: lo:mtu 16436 qdisc noqueue state UNKNOWN link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 52:54:00:c6:65:80 brd ff:ff:ff:ff:ff:ff inet 192.168.122.121/24 brd 192.168.122.255 scope global eth0 inet 192.168.122.130/32 scope global eth0 inet6 fe80::5054:ff:fec6:6580/64 scope link valid_lft forever preferred_lft forever [root@hadoop2 ~]#
已成功切换,在打开浏览器:
实验二:重新启动120,查看下虚拟ip130是否切换过来:
[root@hadoop1 ~]# ip a 1: lo:mtu 16436 qdisc noqueue state UNKNOWN link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 52:54:00:05:7b:31 brd ff:ff:ff:ff:ff:ff inet 192.168.122.120/24 brd 192.168.122.255 scope global eth0 inet 192.168.122.130/32 scope global eth0 inet6 fe80::5054:ff:fe05:7b31/64 scope link valid_lft forever preferred_lft forever [root@hadoop1 ~]#
已成功切换过来,再看121的IP地址:
[root@hadoop2 ~]# ip a 1: lo:mtu 16436 qdisc noqueue state UNKNOWN link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 52:54:00:c6:65:80 brd ff:ff:ff:ff:ff:ff inet 192.168.122.121/24 brd 192.168.122.255 scope global eth0 inet6 fe80::5054:ff:fec6:6580/64 scope link valid_lft forever preferred_lft forever [root@hadoop2 ~]#
查看下浏览器:
已正常切换
备注:keepalived日志文件的分离
注意,如果有报错的话,请查看keepalived日志文件,keepalived日志文件默认的都是写入到/var/log/message下的,由于日志文件过大,现在需要分离出来日志文件,步骤:
1,修改/etc/sysconfig/keepalived和/usr/local/keepalived/etc/sysconfig/keepalived
把KEEPALIVED_OPTIONS="-D" 修改为:KEEPALIVED_OPTIONS="-D -d -S 0"
[root@hadoop1 sysconfig]# vi /etc/sysconfig/keepalived # Options for keepalived. See `keepalived --help' output and keepalived(8) and # keepalived.conf(5) man pages for a list of all options. Here are the most # common ones : # # --vrrp -P Only run with VRRP subsystem. # --check -C Only run with Health-checker subsystem. # --dont-release-vrrp -V Dont remove VRRP VIPs & VROUTEs on daemon stop. # --dont-release-ipvs -I Dont remove IPVS topology on daemon stop. # --dump-conf -d Dump the configuration data. # --log-detail -D Detailed log messages. # --log-facility -S 0-7 Set local syslog facility (default=LOG_DAEMON) # KEEPALIVED_OPTIONS="-D -d -S 0" [root@zabbix ~]# vim /usr/local/keepalived/etc/sysconfig/keepalived # Options for keepalived. See `keepalived --help' output and keepalived(8) and # keepalived.conf(5) man pages for a list of all options. Here are the most # common ones : # # --vrrp -P Only run with VRRP subsystem. # --check -C Only run with Health-checker subsystem. # --dont-release-vrrp -V Dont remove VRRP VIPs & VROUTEs on daemon stop. # --dont-release-ipvs -I Dont remove IPVS topology on daemon stop. # --dump-conf -d Dump the configuration data. # --log-detail -D Detailed log messages. # --log-facility -S 0-7 Set local syslog facility (default=LOG_DAEMON) # KEEPALIVED_OPTIONS="-D -d -S 0"
2,设置syslog,修改/etc/rsyslog.conf 最后添加:
# keepalived -S 0
local0.*
[root@hadoop1 sysconfig]# vi /etc/rsyslog.conf # rsyslog v5 configuration file # For more information see /usr/share/doc/rsyslog-*/rsyslog_conf.html # If you experience problems, see http://www.rsyslog.com/doc/troubleshoot.html #### MODULES #### $ModLoad imklog # provides kernel logging support (previously done by rklogd) #$ModLoad immark # provides --MARK-- message capability # Provides UDP syslog reception #$ModLoad imudp #$UDPServerRun 514 # Provides TCP syslog reception #$ModLoad imtcp #$InputTCPServerRun 514 #### GLOBAL DIRECTIVES #### # Use default timestamp format $ActionFileDefaultTemplate RSYSLOG_TraditionalFileFormat # not useful and an extreme performance hit #$ActionFileEnableSync on # ### begin forwarding rule ### # The statement between the begin ... end define a SINGLE forwarding # rule. They belong together, do NOT split them. If you create multiple # forwarding rules, duplicate the whole block! # Remote Logging (we use TCP for reliable delivery) # # An on-disk queue is created for this action. If the remote host is # down, messages are spooled to disk and sent when it is up again. #$WorkDirectory /var/lib/rsyslog # where to place spool files #$ActionQueueFileName fwdRule1 # unique name prefix for spool files #$ActionQueueMaxDiskSpace 1g # 1gb space limit (use as much as possible) #$ActionQueueSaveOnShutdown on # save messages to disk on shutdown #$ActionQueueType LinkedList # run asynchronously #$ActionResumeRetryCount -1 # infinite retries if host is down # remote host is: name/ip:port, e.g. 192.168.0.1:514, port optional #*.* @@remote-host:514 # ### end of the forwarding rule ### #keepalived -S 0 local0.* /etc/keepalived/keepalived.log
3,重启日志文件:
local0.* /etc/keepalived/keepalived.log
[root@hadoop1 sysconfig]# /etc/init.d/rsyslog restart Shutting down system logger: [ OK ] Starting system logger: [ OK ] [root@hadoop1 sysconfig]# /etc/init.d/keepalived restart Stopping keepalived: [ OK ] Starting keepalived: [root@hadoop1 keepalived]# tail -f /etc/keepalived/keepalived.log Jan 15 17:37:27 hadoop1 Keepalived_vrrp[2008]: VRRP_Instance(VI_1) Received lower prio advert, forcing new election Jan 15 17:37:27 hadoop1 Keepalived_vrrp[2008]: VRRP_Instance(VI_1) Sending gratuitous ARPs on eth0 for 192.168.122.130
######centos7.4则使用
[root@server98 ~]# systemctl restart rsyslog [root@server98 ~]# ll /etc/keepalived/ 总用量 8 -rw-r--r-- 1 root root 677 2月 1 11:31 keepalived.conf -rw-r--r-- 1 root root 3550 2月 1 10:52 keepalived.conf.bak [root@server98 ~]# systemctl restart keepalived 日志正常输出。
常见的错误:
在高可用(HA)系统中,当联系2个节点的“心跳线”断开时,本来为一整体、动作协调的HA系统,就分裂成为2个独立的个体。由于相互失去了联系,都以为是对方出了故障。两个节点上的HA软件像“裂脑人”一样,争抢“共享资源”、争起“应用服务”,就会发生严重后果——或者共享资源被瓜分、2边“服务”都起不来了;或者2边“服务”都起来了,但同时读写“共享存储”,导致数据损坏(常见如数据库轮询着的联机日志出错)。
对付HA系统“裂脑”的对策,目前达成共识的的大概有以下几条:
1)添加冗余的心跳线,例如:双线条线(心跳线也HA),尽量减少“裂脑”发生几率;
2)启用磁盘锁。正在服务一方锁住共享磁盘,“裂脑”发生时,让对方完全“抢不走”共享磁盘资源。但使用锁磁盘也会有一个不小的问题,如果占用共享盘的一方不主动“解锁”,另一方就永远得不到共享磁盘。现实中假如服务节点突然死机或崩溃,就不可能执行解锁命令。后备节点也就接管不了共享资源和应用服务。于是有人在HA中设计了“智能”锁。即:正在服务的一方只在发现心跳线全部断开(察觉不到对端)时才启用磁盘锁。平时就不上锁了。
3)设置仲裁机制。例如设置参考IP(如网关IP),当心跳线完全断开时,2个节点都各自ping一下参考IP,不通则表明断点就出在本端。不仅“心跳”、还兼对外“服务”的本端网络链路断了,即使启动(或继续)应用服务也没有用了,那就主动放弃竞争,让能够ping通参考IP的一端去起服务。更保险一些,ping不通参考IP的一方干脆就自我重启,以彻底释放有可能还占用着的那些共享资源。
1 脑裂产生的原因
一般来说,裂脑的发生,有以下几种原因:
1 高可用服务器对之间心跳线链路发生故障,导致无法正常通信。
因心跳线坏了(包括断了,老化)。
因网卡及相关驱动坏了,ip配置及冲突问题(网卡直连)。
因心跳线间连接的设备故障(网卡及交换机)。
因仲裁的机器出问题(采用仲裁的方案)。
2 高可用服务器上开启了 iptables防火墙阻挡了心跳消息传输。
3高可用服务器上心跳网卡地址等信息配置不正确,导致发送心跳失败。
4 其他服务配置不当等原因,如心跳方式不同,心跳广插冲突、软件Bug等。
提示: Keepalived配置里同一 VRRP实例如果 virtual_router_id两端参数配置不一致也会导致裂脑问题发生。
2 常见的解决方案
在实际生产环境中,我们可以从以下几个方面来防止裂脑问题的发生:
1同时使用串行电缆和以太网电缆连接,同时用两条心跳线路,这样一条线路坏了,另一个还是好的,依然能传送心跳消息。
2 当检测到裂脑时强行关闭一个心跳节点(这个功能需特殊设备支持,如Stonith、feyce)。相当于备节点接收不到心跳消患,通过单独的线路发送关机命令关闭主节点的电源。
3 做好对裂脑的监控报警(如邮件及手机短信等或值班).在问题发生时人为第一时间介入仲裁,降低损失。例如,百度的监控报警短倍就有上行和下行的区别。报警消息发送到管理员手机上,管理员可以通过手机回复对应数字或简单的字符串操作返回给服务器.让服务器根据指令自动处理相应故障,这样解决故障的时间更短.
当然,在实施高可用方案时,要根据业务实际需求确定是否能容忍这样的损失。对于一般的网站常规业务.这个损失是可容忍的。
3.如何通过zabbix对脑裂进行监控:
备服务器上进行监控
备上面出现vip情况:
1)脑裂情况出现
2)正常主备切换也会出现
[root@server98 ~]# more /opt/keepalived.sh #!/bin/bash while true do if [ `ip a show eth0 |grep 192.168.99.100|wc -l` -ne 0 ] then echo "keepalived is error!" else echo "keepalived is OK !" fi done
重要的脚本及配置
####master机上的配置为: [root@zabbix ~]# cat /etc/keepalived/keepalived.conf ! Configuration File for keepalived global_defs { notification_email { #[email protected] } #notification_email_from [email protected] #smtp_server 127.0.0.1 #smtp_connect_timeout 30 router_id LVS_DEVEL } vrrp_script chk_http_port { script "/opt/nginx_pid.sh" interval 1 weight -20 } vrrp_instance VI_1 { state MASTER interface eth0 virtual_router_id 51 priority 100 advert_int 1 authentication { auth_type PASS auth_pass 123456 } virtual_ipaddress { 192.168.99.100 } track_script { chk_http_port } }
#####在备机上的配置为: [root@server98 ~]# cat /etc/keepalived/keepalived.conf ! Configuration File for keepalived global_defs { notification_email { #[email protected] } #notification_email_from [email protected] #smtp_server 127.0.0.1 #smtp_connect_timeout 30 router_id LVS_DEVEL } vrrp_script chk_http_port { script "/opt/nginx_pid.sh" interval 1 weight -20 } vrrp_instance VI_1 { state BACKUP interface eth0 virtual_router_id 51 priority 50 advert_int 1 authentication { auth_type PASS auth_pass 123456 } virtual_ipaddress { 192.168.99.100 } track_script { chk_http_port } }
#####nginx自动启动脚本为: [root@server98 ~]# more /opt/nginx_pid.sh #!/bin/bash A=`ps -C nginx --no-header |wc -l` if [ $A -eq 0 ];then /usr/local/nginx/sbin/nginx sleep 3 if [ `ps -C nginx --no-header |wc -l` -eq 0 ];then killall keepalived fi fi
#####在备机通过zabbix检测脑裂的脚本为: [root@server98 ~]# more /opt/keepalived.sh #!/bin/bash while true do if [ `ip a show eth0 |grep 192.168.99.100|wc -l` -ne 0 ] then echo "keepalived is error!" else echo "keepalived is OK !" fi done
通过命令检测脑裂的脚本为:
[root@server98 ~]# arping 192.168.99.100 ARPING 192.168.99.100 from 192.168.99.98 eth0 Unicast reply from 192.168.99.100 [00:15:5D:63:EC:09] 0.738ms