redis哨兵

【0】redis的主从数据库实现, 参见: redis的主从数据库复制功能

1)redis的主从数据库的作用: 在一主多从的redis系统中, 从数据库起到了 数据冗余备份和读写分离的作用;

2)redis2.8提供的哨兵: 用来实现自动化的系统监控和故障恢复功能;


【1】启动哨兵进程

1)编辑哨兵启动配置文件

在redis_home 目录添加 redis-sentinel.conf 文件, 并修改 其监控的主数据库信息, 如下:

sentinel monitor mymaster 192.168.186.100 6379 1 

其中, mymaster是监控的主数据库的名称,可以自定义,将其与 192.168.186.100 绑定即可; 6379 端口号; 1 表示最低通过票数;

(该conf文件, 在文末会贴出)

2)启动哨兵, 

[pacoson@localhost redis-4.0.8]$ redis-sentinel sentinel.conf 
3256:X 09 Mar 07:46:54.108 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
3256:X 09 Mar 07:46:54.108 # Redis version=4.0.8, bits=32, commit=00000000, modified=0, pid=3256, just started
3256:X 09 Mar 07:46:54.108 # Configuration loaded
3256:X 09 Mar 07:46:54.111 # You requested maxclients of 10000 requiring at least 10032 max file descriptors.
3256:X 09 Mar 07:46:54.111 # Server can't set maximum open files to 10032 because of OS error: Operation not permitted.
3256:X 09 Mar 07:46:54.111 # Current maximum open files is 4096. maxclients has been reduced to 4064 to compensate for low ulimit. If you need higher maxclients increase 'ulimit -n'.
3256:X 09 Mar 07:46:54.113 # Warning: 32 bit instance detected but no memory limit set. Setting 3 GB maxmemory limit with 'noeviction' policy now.
                _._                                                  
           _.-``__ ''-._                                             
      _.-``    `.  `_.  ''-._           Redis 4.0.8 (00000000/0) 32 bit
  .-`` .-```.  ```\/    _.,_ ''-._                                   
 (    '      ,       .-`  | `,    )     Running in sentinel mode
 |`-._`-...-` __...-.``-._|'` _.-'|     Port: 26379
 |    `-._   `._    /     _.-'    |     PID: 3256
  `-._    `-._  `-./  _.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |           http://redis.io        
  `-._    `-._`-.__.-'_.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |                                  
  `-._    `-._`-.__.-'_.-'    _.-'                                   
      `-._    `-.__.-'    _.-'                                       
          `-._        _.-'                                           
              `-.__.-'                                               

3256:X 09 Mar 07:46:54.115 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
3256:X 09 Mar 07:46:54.142 # Sentinel ID is 285f3e0832faf2c4c94a7ce89c2011fe76beb797
3256:X 09 Mar 07:46:54.142 # +monitor master mymaster 192.168.186.100 6379 quorum 1
3256:X 09 Mar 07:46:54.147 * +slave slave 192.168.186.100:6380 192.168.186.100 6380 @ mymaster 192.168.186.100 6379
3256:X 09 Mar 07:46:54.151 * +slave slave 192.168.186.100:6381 192.168.186.100 6381 @ mymaster 192.168.186.100 6379
【说明】

1)最后4行的启动日志非常重要:

哨兵编号;

+monitor: 哨兵监控的主数据库名称及其 ip地址, 端口号等; 

+slave: 表示新发现了 从数据库, 显然哨兵发现了两个从数据库; 


【3】停止主数据库的服务器

[pacoson@localhost redis-4.0.8]$ ps -ef |grep redis
pacoson   2112     1  0 06:23 ?        00:00:17 redis-server 192.168.186.100:6379
pacoson   2557  1931  0 06:25 pts/0    00:00:00 redis-cli -h 192.168.186.100
pacoson   2908  2533  0 07:23 pts/1    00:00:08 redis-server *:6380                                                        
pacoson   2913  2854  0 07:23 pts/5    00:00:08 redis-server *:6381                                                        
pacoson   2918  2662  0 07:23 pts/2    00:00:00 redis-cli -h 192.168.186.100 -p 6380
pacoson   2920  2793  0 07:24 pts/4    00:00:00 redis-cli -h 192.168.186.100 -p 6381
pacoson   3256  2733  1 07:46 pts/3    00:00:06 redis-sentinel *:26379 [sentinel]
pacoson   3314  3270  1 07:55 pts/7    00:00:00 grep redis
[pacoson@localhost redis-4.0.8]$ kill 2112
[pacoson@localhost redis-4.0.8]$ kill 2112
-bash: kill: (2112) - 没有那个进程
[pacoson@localhost redis-4.0.8]$ ps -ef |grep redis
pacoson   2908  2533  0 07:23 pts/1    00:00:08 redis-server *:6380                                                        
pacoson   2913  2854  0 07:23 pts/5    00:00:08 redis-server *:6381                                                        
pacoson   2918  2662  0 07:23 pts/2    00:00:00 redis-cli -h 192.168.186.100 -p 6380
pacoson   2920  2793  0 07:24 pts/4    00:00:00 redis-cli -h 192.168.186.100 -p 6381
pacoson   3256  2733  1 07:46 pts/3    00:00:07 redis-sentinel *:26379 [sentinel]
pacoson   3317  3270  0 07:56 pts/7    00:00:00 grep redis
[pacoson@localhost redis-4.0.8]$ 

【4】等待大约30秒后(可配置), 哨兵进程输出如下内容:

3256:X 09 Mar 07:46:54.115 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
3256:X 09 Mar 07:46:54.142 # Sentinel ID is 285f3e0832faf2c4c94a7ce89c2011fe76beb797
3256:X 09 Mar 07:46:54.142 # +monitor master mymaster 192.168.186.100 6379 quorum 1
3256:X 09 Mar 07:46:54.147 * +slave slave 192.168.186.100:6380 192.168.186.100 6380 @ mymaster 192.168.186.100 6379
3256:X 09 Mar 07:46:54.151 * +slave slave 192.168.186.100:6381 192.168.186.100 6381 @ mymaster 192.168.186.100 6379
// A:主服务器停止后, 哨兵进程的监控日志 
3256:X 09 Mar 07:56:49.816 # +sdown master mymaster 192.168.186.100 6379
3256:X 09 Mar 07:56:49.816 # +odown master mymaster 192.168.186.100 6379 #quorum 1/1
3256:X 09 Mar 07:56:49.816 # +new-epoch 1
// B: 哨兵尝试挑选一个从数据库升级为主数据库,即进行故障恢复;
3256:X 09 Mar 07:56:49.816 # +try-failover master mymaster 192.168.186.100 6379
3256:X 09 Mar 07:56:49.841 # +vote-for-leader 285f3e0832faf2c4c94a7ce89c2011fe76beb797 1
3256:X 09 Mar 07:56:49.841 # +elected-leader master mymaster 192.168.186.100 6379
3256:X 09 Mar 07:56:49.841 # +failover-state-select-slave master mymaster 192.168.186.100 6379
3256:X 09 Mar 07:56:49.896 # +selected-slave slave 192.168.186.100:6380 192.168.186.100 6380 @ mymaster 192.168.186.100 6379
3256:X 09 Mar 07:56:49.896 * +failover-state-send-slaveof-noone slave 192.168.186.100:6380 192.168.186.100 6380 @ mymaster 192.168.186.100 6379
3256:X 09 Mar 07:56:49.973 * +failover-state-wait-promotion slave 192.168.186.100:6380 192.168.186.100 6380 @ mymaster 192.168.186.100 6379
3256:X 09 Mar 07:56:50.814 # +promoted-slave slave 192.168.186.100:6380 192.168.186.100 6380 @ mymaster 192.168.186.100 6379
3256:X 09 Mar 07:56:50.814 # +failover-state-reconf-slaves master mymaster 192.168.186.100 6379
3256:X 09 Mar 07:56:50.886 * +slave-reconf-sent slave 192.168.186.100:6381 192.168.186.100 6381 @ mymaster 192.168.186.100 6379
3256:X 09 Mar 07:56:51.685 * +slave-reconf-inprog slave 192.168.186.100:6381 192.168.186.100 6381 @ mymaster 192.168.186.100 6379
3256:X 09 Mar 07:56:52.754 * +slave-reconf-done slave 192.168.186.100:6381 192.168.186.100 6381 @ mymaster 192.168.186.100 6379
3256:X 09 Mar 07:56:52.817 # +failover-end master mymaster 192.168.186.100 6379
3256:X 09 Mar 07:56:52.817 # +switch-master mymaster 192.168.186.100 6379 192.168.186.100 6380
3256:X 09 Mar 07:56:52.817 * +slave slave 192.168.186.100:6381 192.168.186.100 6381 @ mymaster 192.168.186.100 6380
3256:X 09 Mar 07:56:52.818 * +slave slave 192.168.186.100:6379 192.168.186.100 6379 @ mymaster 192.168.186.100 6380
3256:X 09 Mar 07:57:22.857 # +sdown slave 192.168.186.100:6379 192.168.186.100 6379 @ mymaster 192.168.186.100 6380
【说明】

1)+sdown:表示哨兵主观认为主数据库服务停止了;

2)+odown:表示哨兵客观认为主数据库服务停止了;

3)此时哨兵开始执行故障恢复, 挑选一个从数据库, 将其升级为 主数据库, 输出如下内容:

+try-failover: 表示哨兵开始进行故障恢复;

+failover-end:表示哨兵完成故障恢复, 过程复杂, 包括领头哨兵选择,备选从数据库的选择等;

【4.1】关注最后4条输出:

3256:X 09 Mar 07:56:52.817 # +switch-master mymaster 192.168.186.100 6379 192.168.186.100 6380
3256:X 09 Mar 07:56:52.817 * +slave slave 192.168.186.100:6381 192.168.186.100 6381 @ mymaster 192.168.186.100 6380
3256:X 09 Mar 07:56:52.818 * +slave slave 192.168.186.100:6379 192.168.186.100 6379 @ mymaster 192.168.186.100 6380
3256:X 09 Mar 07:57:22.857 # +sdown slave 192.168.186.100:6379 192.168.186.100 6379 @ mymaster 192.168.186.100 6380
 【说明】 
  

1) +switch-master: 表示主数据库从 6379 端口迁移到 6380 端口上了;即 6380端口上的redis 服务升级为 主数据库;

2)+slave :列出了两个新的从数据库, 包括 6381 和 6379;

3)为什么 6379 还会被当做从数据库呢?  因为停止服务的实例有可能会在之后的某个时间恢复服务,恢复服务后,该数据库将作为 6380主数据库的从数据库;

【5】故障恢复后, 可以info replication 检查6380 和 6381 上的复制信息;

192.168.186.100:6380> info replication
# Replication
role:master // 主数据库 
connected_slaves:1
slave0:ip=192.168.186.100,port=6381,state=online,offset=89055,lag=0
master_replid:4327aef198c3c958e0d8a23414ccb69498cca6f2
master_replid2:28c0673df46b0bceda233e58a7a3cd9ae9d5b2e6
master_repl_offset:89200
second_repl_offset:43056
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:43
repl_backlog_histlen:89158
192.168.186.100:6380> 
192.168.186.100:6381> info replication
# Replication
role:slave // 从数据库 
master_host:192.168.186.100
master_port:6380
master_link_status:up
master_last_io_seconds_ago:0
master_sync_in_progress:0
slave_repl_offset:90098
slave_priority:100
slave_read_only:1
connected_slaves:0
master_replid:4327aef198c3c958e0d8a23414ccb69498cca6f2
master_replid2:28c0673df46b0bceda233e58a7a3cd9ae9d5b2e6
master_repl_offset:90098
second_repl_offset:43056
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:211
repl_backlog_histlen:89888
192.168.186.100:6381> 

【6】将 6379 上的服务重新启动; 并查看其 复制信息, 如下:

redis-server redis.conf

1)启动后, 哨兵进程打印日志如下:

3256:X 09 Mar 08:09:55.694 # -sdown slave 192.168.186.100:6379 192.168.186.100 6379 @ mymaster 192.168.186.100 6380
3256:X 09 Mar 08:10:05.693 * +convert-to-slave slave 192.168.186.100:6379 192.168.186.100 6379 @ mymaster 192.168.186.100 6380
2) 客户端连接 6379 数据库,并打印复制信息, 

192.168.186.100:6379> info replication
# Replication
role:slave // 从数据库
master_host:192.168.186.100
master_port:6380  // 主数据库端口为 6380 
master_link_status:up
master_last_io_seconds_ago:0
master_sync_in_progress:0
slave_repl_offset:103858
slave_priority:100
slave_read_only:1
connected_slaves:0
master_replid:4327aef198c3c958e0d8a23414ccb69498cca6f2
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:103858
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:100141
repl_backlog_histlen:3718
192.168.186.100:6379> 


【7】6379 启动后, 再次查看 6380 的 复制信息, 如下:

192.168.186.100:6380> info replication
# Replication
role:master // 主数据库
connected_slaves:2 // 附带2个从 数据库 6381 和 6379 
slave0:ip=192.168.186.100,port=6381,state=online,offset=110813,lag=0
slave1:ip=192.168.186.100,port=6379,state=online,offset=110813,lag=0
master_replid:4327aef198c3c958e0d8a23414ccb69498cca6f2
master_replid2:28c0673df46b0bceda233e58a7a3cd9ae9d5b2e6
master_repl_offset:110813
second_repl_offset:43056
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:43
repl_backlog_histlen:110771
192.168.186.100:6380> 

【总结】以上过程演示了,当集群中的主数据库 down 掉后, 从数据库升级为主数据库的过程,即6379成功恢复服务;


【补充】sentinel.conf 文件

# Example sentinel.conf

# *** IMPORTANT ***
#
# By default Sentinel will not be reachable from interfaces different than
# localhost, either use the 'bind' directive to bind to a list of network
# interfaces, or disable protected mode with "protected-mode no" by
# adding it to this configuration file.
#
# Before doing that MAKE SURE the instance is protected from the outside
# world via firewalling or other means.
#
# For example you may use one of the following:
#
# bind 127.0.0.1 192.168.1.1
#
# protected-mode no

# port 
# The port that this sentinel instance will run on
port 26379

# sentinel announce-ip 
# sentinel announce-port 
#
# The above two configuration directives are useful in environments where,
# because of NAT, Sentinel is reachable from outside via a non-local address.
#
# When announce-ip is provided, the Sentinel will claim the specified IP address
# in HELLO messages used to gossip its presence, instead of auto-detecting the
# local address as it usually does.
#
# Similarly when announce-port is provided and is valid and non-zero, Sentinel
# will announce the specified TCP port.
#
# The two options don't need to be used together, if only announce-ip is
# provided, the Sentinel will announce the specified IP and the server port
# as specified by the "port" option. If only announce-port is provided, the
# Sentinel will announce the auto-detected local IP and the specified port.
#
# Example:
#
# sentinel announce-ip 1.2.3.4

# dir 
# Every long running process should have a well-defined working directory.
# For Redis Sentinel to chdir to /tmp at startup is the simplest thing
# for the process to don't interfere with administrative tasks such as
# unmounting filesystems.
dir "/tmp"

# sentinel monitor    
#
# Tells Sentinel to monitor this master, and to consider it in O_DOWN
# (Objectively Down) state only if at least  sentinels agree.
#
# Note that whatever is the ODOWN quorum, a Sentinel will require to
# be elected by the majority of the known Sentinels in order to
# start a failover, so no failover can be performed in minority.
#
# Slaves are auto-discovered, so you don't need to specify slaves in
# any way. Sentinel itself will rewrite this configuration file adding
# the slaves using additional configuration options.
# Also note that the configuration file is rewritten when a
# slave is promoted to master.
#
# Note: master name should not include special characters or spaces.
# The valid charset is A-z 0-9 and the three characters ".-_".
sentinel myid 285f3e0832faf2c4c94a7ce89c2011fe76beb797

# sentinel auth-pass  
#
# Set the password to use to authenticate with the master and slaves.
# Useful if there is a password set in the Redis instances to monitor.
#
# Note that the master password is also used for slaves, so it is not
# possible to set a different password in masters and slaves instances
# if you want to be able to monitor these instances with Sentinel.
#
# However you can have Redis instances without the authentication enabled
# mixed with Redis instances requiring the authentication (as long as the
# password set is the same for all the instances requiring the password) as
# the AUTH command will have no effect in Redis instances with authentication
# switched off.
#
# Example:
#
# sentinel auth-pass mymaster MySUPER--secret-0123passw0rd

# sentinel down-after-milliseconds  
#
# Number of milliseconds the master (or any attached slave or sentinel) should
# be unreachable (as in, not acceptable reply to PING, continuously, for the
# specified period) in order to consider it in S_DOWN state (Subjectively
# Down).
#
# Default is 30 seconds.
sentinel monitor mymaster 192.168.186.100 6379 1  // highlight

# sentinel parallel-syncs  
#
# How many slaves we can reconfigure to point to the new slave simultaneously
# during the failover. Use a low number if you use the slaves to serve query
# to avoid that all the slaves will be unreachable at about the same
# time while performing the synchronization with the master.
sentinel config-epoch mymaster 0

# sentinel failover-timeout  
#
# Specifies the failover timeout in milliseconds. It is used in many ways:
#
# - The time needed to re-start a failover after a previous failover was
#   already tried against the same master by a given Sentinel, is two
#   times the failover timeout.
#
# - The time needed for a slave replicating to a wrong master according
#   to a Sentinel current configuration, to be forced to replicate
#   with the right master, is exactly the failover timeout (counting since
#   the moment a Sentinel detected the misconfiguration).
#
# - The time needed to cancel a failover that is already in progress but
#   did not produced any configuration change (SLAVEOF NO ONE yet not
#   acknowledged by the promoted slave).
#
# - The maximum time a failover in progress waits for all the slaves to be
#   reconfigured as slaves of the new master. However even after this time
#   the slaves will be reconfigured by the Sentinels anyway, but not with
#   the exact parallel-syncs progression as specified.
#
# Default is 3 minutes.
sentinel leader-epoch mymaster 0

# SCRIPTS EXECUTION
#
# sentinel notification-script and sentinel reconfig-script are used in order
# to configure scripts that are called to notify the system administrator
# or to reconfigure clients after a failover. The scripts are executed
# with the following rules for error handling:
#
# If script exits with "1" the execution is retried later (up to a maximum
# number of times currently set to 10).
#
# If script exits with "2" (or an higher value) the script execution is
# not retried.
#
# If script terminates because it receives a signal the behavior is the same
# as exit code 1.
#
# A script has a maximum running time of 60 seconds. After this limit is
# reached the script is terminated with a SIGKILL and the execution retried.

# NOTIFICATION SCRIPT
#
# sentinel notification-script  
#
# Call the specified notification script for any sentinel event that is
# generated in the WARNING level (for instance -sdown, -odown, and so forth).
# This script should notify the system administrator via email, SMS, or any
# other messaging system, that there is something wrong with the monitored
# Redis systems.
#
# The script is called with just two arguments: the first is the event type
# and the second the event description.
#
# The script must exist and be executable in order for sentinel to start if
# this option is provided.
#
# Example:
#
# sentinel notification-script mymaster /var/redis/notify.sh

# CLIENTS RECONFIGURATION SCRIPT
#
# sentinel client-reconfig-script  
#
# When the master changed because of a failover a script can be called in
# order to perform application-specific tasks to notify the clients that the
# configuration has changed and the master is at a different address.
#
# The following arguments are passed to the script:
#
#       
#
#  is currently always "failover"
#  is either "leader" or "observer"
#
# The arguments from-ip, from-port, to-ip, to-port are used to communicate
# the old address of the master and the new address of the elected slave
# (now a master).
#
# This script should be resistant to multiple invocations.
#
# Example:
#
# sentinel client-reconfig-script mymaster /var/redis/reconfig.sh

# Generated by CONFIG REWRITE
maxclients 4064
maxmemory 3gb
sentinel known-slave mymaster 192.168.186.100 6381
sentinel known-slave mymaster 192.168.186.100 6380
sentinel current-epoch 0


你可能感兴趣的:(redis)