线上MYSQL同步报错故障处理总结
公司使用腾讯云数据库,今天在从库上面查询相关数据时候,显示没有任何记录,登录后
show slave status\G 查看到状态中报1032错误,这里把相关主从同步故障总结一下。
先上Master库:
mysql>show processlist; 查看下进程是否Sleep太多。发现很正常。
mysql> show master status; 也是正常的。
+-------------------+----------+--------------+-------------------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB |
+-------------------+----------+--------------+-------------------------------+
| mysqld-bin.000001 | 3260 | | mysql,test,information_schema |
+-------------------+----------+--------------+-------------------------------+
1 row in set (0.00 sec)
解决方法:
方法一:忽略错误后,继续同步
适用于主从库数据相差不大,或者要求数据可以不完全统一的情况,数据要求不严格的情况
mysql> Stop slave ;
mysql> set global sql_slave_skip_counter=1;
mysql> start slave;
mysql> show slave status\G 查看:
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
ok,现在主从同步状态正常了
方式二:指定跳过错误代码,继续同步
主键冲突、表已存在等错误代码如1062,1032,1060等,可以在mysql主配置文件指定略过此类异常并继续下条sql同步,这样也可以避免很多主从同步的异常中断
[mysqld]
slave-skip-errors = 1062,1032,1060
重新启动mysql
service mysqld restart
之后再用mysql> show slave status\G 查看:
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
方式三:重新做主从,完全同步
适用于主从库数据相差较大,或者要求数据完全统一的情况 。
1.先进入主库,进行锁表,防止数据写入
mysql> flush tables with read lock; 注意:该处是锁定为只读状态,语句不区分大小写
2.在主库上面进行数据备份
[root@server01 mysql]#mysqldump -uroot -p -hlocalhost > mysql.bak.sql
注意:数据库备份一定要定期进行,可以用shell脚本或者python脚本,确保数据万无一失 。
3.查看master 状态
mysql> show master status;
+-------------------+----------+--------------+-------------------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB |
+-------------------+----------+--------------+-------------------------------+
| mysqld-bin.000001 | 3260 | | mysql,test,information_schema |
+-------------------+----------+--------------+-------------------------------+
1 row in set (0.00 sec)
4.把mysql备份文件传到从库机器,进行数据恢复
[root@server01 mysql]# scp mysql.bak.sql [email protected]:/tmp/
5.停止从库的状态
mysql> stop slave;
6.然后到从库执行mysql命令,导入数据备份
mysql> source /tmp/mysql.bak.sql
7.设置从库同步,注意该处的同步点,就是主库show master status信息里的| File| Position两项
change master to master_host = '192.168.128.100', master_user = 'rsync', master_port=3306, master_password='', master_log_file = 'mysqld-bin.000001', master_log_pos=3260;
8.重新开启从库同步
mysql> start slave;
9.查看同步状态
mysql> show slave status\G 查看:
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
好了,同步完成啦。
http://www.ilanni.com/?p=7992
MySQL出现同步延迟解决优化方法
1.主从复制的从库太多导致复制延迟
优化:建议从库数量3-5个为宜(具体看自己硬件配置)
2.从库硬件比主库硬件差
优化:提升硬件性能
3.慢SQL语句过多
优化:SQL语句执行时间太长,需要优化SQL语句(需要联系DBA开发共同商讨优化语句)
4.主从复制的设计问题
优化:主从复制单线程,可以通过多线程IO方案解决;另外MySQL5.6.3支持多线程IO复制。
5.主从库之间的网络延迟
优化:尽量链路短,提升端口带宽
6.主库读写压力大
优化:前端加buffer和缓存。主从延迟不同步(延迟的多少,只要不影响业务就没事)
7、业务设计缺陷导致延迟影响业务
优化:从库没有数据改读主库
前言
在发生故障切换后,经常遇到的问题就是同步报错,数据库很小的时候,在主库Mysqldump导出数据,再从库导入就处理好了,但如果线上的数据库都150G-200G,如果用单纯的这种方法,现总结了几种处理方法。
生产环境架构图
目前现网的架构,保存着两份数据,通过异步复制做的高可用集群,两台机器都提供对外服务。在发生故障时,切换到slave上,并将其变成master,坏掉的机器反向同步新的master,在处理故障时,遇到最多的就是主从报错。
异步半同步区别
异步复制
简单说就是master把binlog发送过去,不管slave是否接收完,也不管是否执行完,这一动作就结束了.
半同步复制
就是master把binlog发送过去,slave确认接收完,但不管它是否执行完,给master一个信号我这边收到了,这一动作就结束了。(5.5上正式应用。)
异步的劣势
当master上写操作繁忙时,当前POS点例如是10,而slave上IO_THREAD线程接收过来的是3,此时master宕机,会造成相差7个点未传送到slave上而数据丢失。
错误汇总
最常见的3种情况及处理方法
这3种情况是在HA切换时,由于是异步复制,且sync_binlog=0,会造成一小部分binlog没接收完导致同步报错。
第一种:在master上删除一条记录,而slave上找不到。
Last_SQL_Error: Could not execute Delete_rows event on table hcy.t1; Can't find record in 't1', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000006, end_log_pos 254
解决方法:由于master要删除一条记录,而slave上找不到故报错,这种情况主上都将其删除了,那么从机可以直接跳过。可用命令:
stop slave; set global sql_slave_skip_counter=1; start slave;
如果这种情况很多,可用我写的一个脚本skip_error_replcation.sh,默认跳过10个错误(只针对这种情况才跳,其他情况输出错误结果,等待处理),这个脚本是参考maakit工具包的mk-slave-restart原理用shell写的,功能上定义了一些自己的东西,不是无论什么错误都一律跳过。)
第二种:主键重复。在slave已经有该记录,又在master上插入了同一条记录。
Last_SQL_Errno: 1062 Last_SQL_Error: Error 'Duplicate entry '192442' for key 'PRIMARY'' on query. Default database: 'xxxxxxxxxxxxxx'. Query: 'INSERT INTO xxxxxxxxxxx(playerId, `type`, `count`)VALUES( NAME_CONST('pPlayerId',629014986), NAME_CONST('pType',8), 0)'
解决方法:
在master上查看一下记录:
mysql> select * from hcy.t1 where id=192442; +--------+-----------+------+-------+-------+---------------+ | id | playerId | type | count | total | lastResetTime | +--------+-----------+------+-------+-------+---------------+ | 192442 | 629014986 | 8 | 0 | 0 | 0 | +--------+-----------+------+-------+-------+---------------+ row in set (0.00 sec)
在slave上查看记录
mysql> select * from hcy.t1 where id=192442; +--------+-----------+------+-------+-------+---------------+ | id | playerId | type | count | total | lastResetTime | +--------+-----------+------+-------+-------+---------------+ | 192442 | 629015414 | 8 | 0 | 0 | 0 | +--------+-----------+------+-------+-------+---------------+ row in set (0.00 sec)
在slaver上用desc hcy.t1; 先看下表结构:
mysql> desc hcy.t1; +-------+---------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-------+---------+------+-----+---------+-------+ | id | int(11) | NO | PRI | 0 | | | name | char(4) | YES | | NULL | | +-------+---------+------+-----+---------+-------+
可以看到slave的记录与主库不一样,当然以主库的记录为准,所以要在slave库上把主键冲突的记录删除掉,
slave上面删除重复的主键
mysql> delete from t1 where id=192442; mysql> start slave; mysql> show slave status\G; .... Slave_IO_Running: Yes Slave_SQL_Running: Yes …… mysql> select * from t1 where id=192442;
在master上和slave上再分别确认一下。
如果一两条,手动删除没什么问题,但比较多的时候,手动删除的效率是灰常慢的,下面写了个脚本,这个脚本只适合主键冲突的情况。
#!/bin/bash #Delete duplicate records primary key conflict mysql=/usr/local/mysql-5.1.66/bin/mysql sock=/data/mysql-slave-3311/mysql.sock passwd=123456 while true do SQL_THREAD=`$mysql -uroot -p$passwd -S $sock -e 'show slave status\G' | egrep 'Slave_SQL_Running' | awk '{print $2}'` LAST_ERROR=`$mysql -uroot -p$passwd -S $sock -e 'show slave status\G' | egrep Last_Errno | awk '{print $2}'` duplicate=`$mysql -uroot -p$passwd -S $sock -e 'show slave status\G' | grep Last_Error | awk '/Duplicate entry/{print $5}' | awk -F "'" '{print $2}'` DATABASE=`$mysql -uroot -p$passwd -S $sock -e 'show slave status\G' | grep Last_Error | awk '{print $13}' | awk -F "'" '{print $2}'` TABLE=`$mysql -uroot -p$passwd -S $sock -e 'show slave status\G' | grep Last_Error | awk -F ":" '{print $4}' | awk -F "(" '{print $1}' | awk '{print $NF}'` $mysql -uroot -p$passwd -S $sock -e 'show slave status\G' | grep HA_ERR_FOUND_DUPP_KEY if [ $? -eq 1 ] then if [ "$SQL_THREAD" == No ] && [ "$LAST_ERROR" == 1062 ] then FILED=`$mysql -uroot -p$passwd -S $sock -Nse "desc $DATABASE.$TABLE" | grep PRI | awk '{print $1}'` $mysql -uroot -p$passwd -S $sock -e "delete from $DATABASE.$TABLE where $FILED=$duplicate" $mysql -uroot -p$passwd -S $sock -e "start slave sql_thread" else echo "====================== ok ========================" $mysql -uroot -p$passwd -S $sock -e 'show slave status\G' | egrep 'Slave_.*_Running' echo "====================== ok ========================" break fi fi done
第三种:在master上更新一条记录,而slave上找不到,丢失了数据。
Last_SQL_Error: Could not execute Update_rows event on table hcy.t1; Can't find record in 't1', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000010, end_log_pos 794
解决方法:在master上,用mysqlbinlog 分析下出错的binlog日志在干什么。
/usr/local/mysql/bin/mysqlbinlog --no-defaults -v -v --base64-output=DECODE-ROWS mysql-bin.000010 | grep -A '10' 794 #120302 12:08:36 server id 22 end_log_pos 794 Update_rows: table id 33 flags: STMT_END_F ### UPDATE hcy.t1 ### WHERE ### @1=2 /* INT meta=0 nullable=0 is_null=0 */ ### @2='bbc' /* STRING(4) meta=65028 nullable=1 is_null=0 */ ### SET ### @1=2 /* INT meta=0 nullable=0 is_null=0 */ ### @2='BTV' /* STRING(4) meta=65028 nullable=1 is_null=0 */ # at 794 #120302 12:08:36 server id 22 end_log_pos 821 Xid = 60 COMMIT/*!*/; DELIMITER ; # End of log file ROLLBACK /* added by mysqlbinlog */; /*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
在slave上,查找下更新后的那条记录,应该是不存在的。
mysql> select * from t1 where id=2; Empty set (0.00 sec)
然后再到master查看
mysql> select * from t1 where id=2; +----+------+ | id | name | +----+------+ | 2 | BTV | +----+------+
把丢失的数据在slave上填补,然后跳过报错即可。
mysql> insert into t1 values (2,'BTV'); mysql> select * from t1 where id=2; +----+------+ | id | name | +----+------+ | 2 | BTV | +----+------+ 1 row in set (0.00 sec) mysql> stop slave ;set global sql_slave_skip_counter=1;start slave; mysql> show slave status\G; Slave_IO_Running: Yes Slave_SQL_Running: Yes
正常同步了。如果有N多数据缺失,得用pt-table-checksum校验数据一致性,为什么slave库上会少数据呢?我总结了以下几种情况:
1、当人为设置set session sql_log_bin=0时,当前session操作是不记录到Binlog的。
2、就是slave没设置为read only,在slave库上有删除操作
3、slave读取master的binlog日志后,需要落地3个文件:relay log、relay log info、master info,这三个文件如果不及时落地,则主机crash后会导致数据的不一致
特殊的情况
第一种:slave的中继日志relay-bin损坏。
这种情况SLAVE在宕机,或者非法关机,例如电源故障、主板烧了等,造成中继日志损坏,同步停掉。
Last_SQL_Error: Error initializing relay log position: I/O error reading the header from the binary log Last_SQL_Error: Error initializing relay log position: Binlog has bad magic number; It's not a binary log file that can be used by this version of MySQL
解决方法:找到同步的binlog和POS点,然后重新做同步,这样就可以有新的中继日值了。
例子:
以执行写的binlog和POS点为准。
Relay_Master_Log_File: mysql-bin.000010 Exec_Master_Log_Pos: 821
mysql> stop slave; mysql> CHANGE MASTER TO MASTER_LOG_FILE='mysql-bin.000010',MASTER_LOG_POS=821; mysql> start slave;
如果出现slave丢失数据太多,ibbackup(需要银子)该你登场了。
Ibbackup热备份工具,是付费的。xtrabackup是免费的,功能上一样。
Ibbackup备份期间不锁表,备份时开启一个事务(相当于做一个快照),然后会记录一个点,之后数据的更改保存在ibbackup_logfile文件里,恢复时把ibbackup_logfile 变化的数据再写入到ibdata里。
Ibbackup 只备份数据( ibdata、.ibd ),表结构.frm不备份。
下面一个演示例子:
备份:ibbackup /bak/etc/my_local.cnf /bak/etc/my_bak.cnf
恢复:ibbackup --apply-log /bak/etc/my_bak.cnf
[root@vm01 etc]# ibbackup /bak/etc/my_local.cnf /bak/etc/my_bak.cnf C ontents of /bak/etc/my_local.cnf: innodb_data_home_dir got value /usr/local/mysql/data innodb_data_file_path got value ibdata1:10M:autoextend datadir got value /usr/local/mysql/data innodb_log_group_home_dir got value /usr/local/mysql/data innodb_log_files_in_group got value 2 innodb_log_file_size got value 5242880 Contents of /bak/etc/my_bak.cnf: innodb_data_home_dir got value /bak/data innodb_data_file_path got value ibdata1:10M:autoextend datadir got value /bak/data innodb_log_group_home_dir got value /bak/data innodb_log_files_in_group got value 2 innodb_log_file_size got value 5242880 ibbackup: Found checkpoint at lsn 0 1636898 20302 16:47:44 ibbackup: Copying /usr/local/mysql/data/ibdata1 ibbackup: A copied database page was modified at 0 1636898 ibbackup: Scanned log up to lsn 0 1636898 ibbackup: Was able to parse the log up to lsn 0 1636898 ibbackup: Maximum page number for a log record 0 120302 16:47:46 ibbackup: Full backup completed! [root@vm01 etc]# [root@vm01 etc]# cd /bak/data/ [root@vm01 data]# ls ibbackup_logfile ibdata1
[root@vm01 data]# ibbackup --apply-log /bak/etc/my_bak.cnf contents of /bak/etc/my_bak.cnf: innodb_data_home_dir got value /bak/data innodb_data_file_path got value ibdata1:10M:autoextend datadir got value /bak/data innodb_log_group_home_dir got value /bak/data innodb_log_files_in_group got value 2 innodb_log_file_size got value 5242880 120302 16:48:38 ibbackup: ibbackup_logfile's creation parameters: ibbackup: start lsn 0 1636864, end lsn 0 1636898, ibbackup: start checkpoint 0 1636898 InnoDB: Starting an apply batch of log records to the database... InnoDB: Progress in percents: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 .....99 Setting log file size to 0 5242880 ibbackup: We were able to parse ibbackup_logfile up to ibbackup: lsn 0 1636898 ibbackup: Last MySQL binlog file position 0 1191, file name ./mysql-bin.000010 ibbackup: The first data file is '/bak/data/ibdata1' ibbackup: and the new created log files are at '/bak/data/' 120302 16:48:38 ibbackup: Full backup prepared for recovery successfully! [root@vm01 data]# ls ibbackup_logfile ibdata1 ib_logfile0 ib_logfile1
把ibdata1 ib_logfile0 ib_logfile1拷贝到从库上面,把.frm也拷贝过去,启动MYSQL后,做同步,那个点就是上面输出的:
第二种:人为操作产生多台slave存在重复server-id
这种情况同步会一直延时,永远也同步不完,error错误日志里一直出现上面两行信息。解决方法就是把server-id改成不一致即可。
Slave: received end packet from server, apparent master shutdown: Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mysql-bin.000012' at postion 106
参考:http://www.cnblogs.com/xuanzhi201111/p/4566451.html
http://xstarcd.github.io/wiki/MySQL/online_mysqlrepl_error.html
MySQL错误代码中文说明
1005:创建表失败
1006:创建数据库失败
1007:数据库已存在,创建数据库失败
1008:数据库不存在,删除数据库失败
1009:不能删除数据库文件导致删除数据库失败
1010:不能删除数据目录导致删除数据库失败
1011:删除数据库文件失败
1012:不能读取系统表中的记录
1020:记录已被其他用户修改
1021:硬盘剩余空间不足,请加大硬盘可用空间
1022:关键字重复,更改记录失败
1023:关闭时发生错误
1024:读文件错误
1025:更改名字时发生错误
1026:写文件错误
1032:记录不存在
1036:数据表是只读的,不能对它进行修改
1037:系统内存不足,请重启数据库或重启服务器
1038:用于排序的内存不足,请增大排序缓冲区
1040:已到达数据库的最大连接数,请加大数据库可用连接数
1041:系统内存不足
1042:无效的主机名
1043:无效连接
1044:当前用户没有访问数据库的权限
1045:不能连接数据库,用户名或密码错误
1048:字段不能为空
1049:数据库不存在
1050:数据表已存在
1051:数据表不存在
1054:字段不存在
1065:无效的SQL语句,SQL语句为空
1081:不能建立Socket连接
1114:数据表已满,不能容纳任何记录
1116:打开的数据表太多
1129:数据库出现异常,请重启数据库
1130:连接数据库失败,没有连接数据库的权限
1133:数据库用户不存在
1141:当前用户无权访问数据库
1142:当前用户无权访问数据表
1143:当前用户无权访问数据表中的字段
1146:数据表不存在
1147:未定义用户对数据表的访问权限
1149:SQL语句语法错误
1158:网络错误,出现读错误,请检查网络连接状况
1159:网络错误,读超时,请检查网络连接状况
1160:网络错误,出现写错误,请检查网络连接状况
1161:网络错误,写超时,请检查网络连接状况
1062:字段值重复,入库失败
1169:字段值重复,更新记录失败
1177:打开数据表失败
1180:提交事务失败
1181:回滚事务失败
1203:当前用户和数据库建立的连接已到达数据库最大连接数,请增大可用数据库连接数或重启数据库
1205:加锁超时
1211:当前用户没有创建用户的权限
1216:外键约束检查失败,更新子表记录失败
1217:外键约束检查失败,删除或修改主表记录失败
1226:当前用户使用的资源已超过所允许的资源,请重启数据库或重启服务器
1227:权限不足,您无权进行此操作
1235:MySQL版本过低,不具有本功能
主从同步不一致实现邮件报警
方法一:
环境:由于我们采用腾讯云数据库做的主从,不能登录到云数据库上面,只能从其他一台云服务器上面进行监控。
定期验证MYSQL主从数据一致性在某些时候是一项很值得做的事情,google还提供了MYSQL补丁来自动实现该功能。但某些时候我们能够定期检查到复制的状态也足够了,这里借鉴网上一位网友自动检查mysql主从复制状态的脚本,该脚本实现功能:
1.通过mysql命令判断mysql从服务器三个主要的复制状态值是否正常,检测到有问题后,根据设定时间间隔值,再重复检测两遍,其中Seconds_Behind_Master值因为不能准确地描述复制延迟,所以在3600秒以下都判断为正常。
2.检测到有问题后记录日志并发送邮件通知状态值内容,问题持续则根据设定时间间隔值再发邮件通知,恢复正常也邮件通知。另外,无论正常与否,每天都发送一次邮件通知,由计划任务和时间相关的变量值决定。
该脚本可以监控主从复制状态,如果要验证主从MYSQL数据一致性,可以研究下google提供的MYSQL补丁。
第一步:首先安装mailx组件并配置好能够通过三方邮箱发送邮件
yum -y install mailx
然后编辑mailx的配置文件
vi /etc/mail.rc
在最末尾添加如下信息
set [email protected] smtp=smtp.qq.com
set [email protected] smtp-auth-password=xxxxxx
set smtp-auth=login
保存退出后测试邮件是否能够正常发送出去
echo"zabbix test mail" |mail -s "zabbix" [email protected]
[root@monitor scripts]# cat check_replication_status.sh #/bin/bash ############################################# # author zhaoyanan # date 2013/01/25 create # update 2013/01/28 Adding duplicate detection # # Execution: # touch /root/sh/mysql_slave_status.sh; chmod 700 /root/sh/mysql_slave_status.sh # vi /etc/crontab # 5,15,25,35,45,55 * * * * root /root/sh/mysql_slave_status.sh >> /root/sh/mysql_slave_status.log 2>&1 ############################################# ######### set variables ############ tmpdir=/tmp mysqlhost="x.x.x.x" mysqlport="3306" mysqlsocket="/var/lib/mysql/mysql.sock" mysqlbinpath="/usr/bin/" mysqluser="test" mysqlpw="xxxx" servername="123" normal_status="$servername slave status ok!" problem_status="$servername slave status problems!" problem_many_status="$servername slave status problems! (too many times)" returm_to_nomal="$servername slave status return to normal from the question!" repeat_alarm_time=12 # Repeat alarm time interval ( About *10 minute. value of 12, about two hours) failure_interval=10 # Interval after a problem is detected, the unit: seconds ( < 25 seconds) reporting_time=0855 # Must be notified of the time, even if normal. current_time=$(date +%H%M) # Current time export PATH="$mysqlbinpath":/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin ###### do date #################################### echo "" echo "`date` Start monitoring." echo "" #### function #### function mailto1() { echo "$servername slave status: $slaveiostatus $slavesqlstatus $slavebehind" | mail -s "$normal_status" [email protected] #/bin/mail -s "$normal_status" [email protected] < /tmp/normal_status.txt } function mailto2() { echo "$servername slave status: $slaveiostatus $lastioerror $slavesqlstatus $lastsqlerror $slavebehind " | mail -s "$problem_status" [email protected] #/bin/mail -s "$problem_status" [email protected] < /tmp/problem_status.txt } function mailto3() { echo "$servername slave status: $slaveiostatus $lastioerror $slavesqlstatus $lastsqlerror $slavebehind" | mail -s "$problem_many_status" [email protected] #/bin/mail -s "$problem_many_status" [email protected] < /tmp/problem_status.txt } function mailto4() { echo "$servername slave status: $slaveiostatus $slavesqlstatus $slavebehind" | mail -s "$returm_to_nomal" [email protected] #/bin/mail -s "$returm_to_nomal" [email protected] < /tmp/return_to_nomail.txt } function showstatus() { /usr/bin/mysql -S $mysqlsocket -u"$mysqluser" -h"$mysqlhost" -p"$mysqlpw" -e "show slave status\G" > "$tmpdir"/"$servername"_status.txt slaveiostatus=`cat "$tmpdir"/"$servername"_status.txt | grep "Slave_IO_Running" | sed 's/^[ \t]*//g'` lastioerror=`cat "$tmpdir"/"$servername"_status.txt | grep "Last_IO_Error" | sed 's/^[ \t]*//g'` slavesqlstatus=`cat "$tmpdir"/"$servername"_status.txt | grep "Slave_SQL_Running" | sed 's/^[ \t]*//g'` lastsqlerror=`cat "$tmpdir"/"$servername"_status.txt | grep "Last_SQL_Error" | sed 's/^[ \t]*//g'` slavebehind=`cat "$tmpdir"/"$servername"_status.txt | grep "Seconds_Behind_Master" | sed 's/^[ \t]*//g'` slaveiovalue=`cat "$tmpdir"/"$servername"_status.txt | grep "Slave_IO_Running" | sed 's/^[ \t]*//g' | awk -F ': ' '{print $2}'` slavesqlvalue=`cat "$tmpdir"/"$servername"_status.txt | grep "Slave_SQL_Running" | sed 's/^[ \t]*//g' | awk -F ': ' '{print $2}'` slavebehindvalue=`echo "$slavebehind" | awk -F ': ' '{print $2}' | grep '[0-9]'` } ### check #### test -f "$tmpdir"/"$servername"_m_value || echo "0" > "$tmpdir"/"$servername"_m_value test -f "$tmpdir"/"$servername"_n_value || echo "0" > "$tmpdir"/"$servername"_n_value m=`cat "$tmpdir"/"$servername"_m_value` n=`cat "$tmpdir"/"$servername"_n_value` if showstatus;sync;sleep 1 [ "$slaveiovalue" == "Yes" ] && [ "$slavesqlvalue" == "Yes" ] && [ "$slavebehindvalue" -le 3600 ];then echo "`date` First detected, $servername slave status ok!" m=0; echo $m > "$tmpdir"/"$servername"_m_value elif sleep $failure_interval showstatus;sync;sleep 1 [ "$slaveiovalue" == "Yes" ] && [ "$slavesqlvalue" == "Yes" ] && [ "$slavebehindvalue" -le 3600 ];then echo "`date` Second detection, $servername slave status ok!" m=0; echo $m > "$tmpdir"/"$servername"_m_value elif sleep $failure_interval showstatus;sync;sleep 1 [ "$slaveiovalue" == "Yes" ] && [ "$slavesqlvalue" == "Yes" ] && [ "$slavebehindvalue" -le 3600 ];then echo "`date` Third detection, $servername slave status ok!" m=0; echo $m > "$tmpdir"/"$servername"_m_value else echo "`date` After three detection, $servername slave problems!" m=$(($m+1)); echo $m > "$tmpdir"/"$servername"_m_value fi #### log and mail #### if [ "$reporting_time" -eq "$current_time" ] && [ "$m" -eq 0 ] && [ "$n" -eq 0 ]; then echo "mailto." mailto1 elif [ "$m" -eq 1 ] && [ "$n" -eq 0 ]; then echo "`date` $servername slave status problems! mailto2." echo "$servername slave status: " echo "$slaveiostatus" echo "$lastioerror" echo "$slavesqlstatus" echo "$lastsqlerror" echo "$slavebehind" mailto2 n=1; echo $n > "$tmpdir"/"$servername"_n_value elif [ "$m" -eq "$repeat_alarm_time" ] && [ "$n" -eq 1 ]; then mailto3 echo "`date` $servername slave status problems! too many times, mailto3." m=0; echo $m > "$tmpdir"/"$servername"_m_value elif [ "$m" -eq 0 ] && [ "$n" -eq 1 ]; then mailto4 echo "$servername slave status return to normal from the question. mailto4." n=0; echo $n > "$tmpdir"/"$servername"_n_value elif [ "$m" -ne 0 ] && [ "$n" -eq 1 ]; then echo "`date` $servername slave status problems! too many times, nomailto" fi [root@monitor scripts]#
收到报警邮件
参考:http://zhaoyanan.blog.51cto.com/411564/1127828
方法二:
下载并解压sendEmail
tar -zxvf sendEmail-v1.56.tar.gz cp sendEmail-v1.56/sendEmail /usr/local/bin/ 拷贝之后就能用了 发邮件命令: /usr/local/bin/sendEmail -f [email protected] -t [email protected] -s smtp.163.com -u "test" -xu zhang_peicheng -xp xxxxxxxxxxxxx -m "this is a test email" -a /tmp/123.txt 简要说明 : –f 发送方地址 –t 发给谁 –s 发送方smtp服务器 –u 主题 -xu 发送邮件账号 –xp 发送方邮箱的密码 -m 内容 –a 邮件附件
2.编写mysql主从监控脚本
# cat mysql_check_replication_status.sh #!/bin/bash SLAVE_IO_S=`/usr/bin/mysql -uroot -h 10.10.10.10 -p123456 -e "show slave status\G" | grep "Slave_IO_Running" | awk -F ':' '{print $2}'` SLAVE_SQL_S=`/usr/bin/mysql -uroot -h 10.10.10.10 -p123456 -e "show slave status\G" | grep "Slave_SQL_Running" | awk -F ':' '{print $2}'` date1=`date +%Y%m%d` if [ $SLAVE_IO_S == "Yes" ] && [ $SLAVE_SQL_S == "Yes" ];then echo "the mysql-master-slave-status is OK" else echo "the mysql-master-slave-status is falied" if [ ! -d /tmp/$data1 ];then mkdir -p /tmp/$data1 fi /usr/bin/mysql -uroot -h 10.10.10.10 -p123456 -e "show slave status\G" > /tmp/$data1/mysql-master-slave-status.txt /usr/local/bin/sendEmail -f [email protected] -t [email protected] -s smtp.exmail.qq.com -u "mysql-master-slave-status" -xu [email protected] -xp 123456 -m "the mysql-master-slave status is failed" -a /tmp/$data1/mysql-master-slave-status.txt fi
#!/bin/bash mysql_user="root" mysql_pass="123456" email_addr="[email protected]" mysql_status=`netstat -nl | awk 'NR>2{if ($4 ~ /.*:3306/) {print "Yes";exit 0}}'` if [ "$mysql_status" == "Yes" ];then slave_status=`mysql -u${mysql_user} -p${mysql_pass} -e"show slave status\G" | grep "Running" | awk '{if ($2 != "Yes") {print "No";exit 1}}'` if [ "$slave_status" == "No" ];then echo "slave is not working!" [ ! -f "/tmp/slave" ] && echo "Slave is not working!" | mail -s "Warn!MySQL Slave is not working" ${email_addr} touch /tmp/slave else echo "slave is working." [ -f "/tmp/slave" ] && rm -f /tmp/slave fi [ -f "/tmp/mysql_down" ] && rm -f /tmp/mysql_down else [ ! -f "/tmp/mysql_down" ] && echo "Mysql Server is down!" | mail -s "Warn!MySQL server is down!" ${email_addr} touch /tmp/mysql_down fi
监控检测redis主从状态的脚本
# cat check_replication_status_redis.sh #!/bin/bash STATUS=`/usr/local/bin/redis-cli -h 127.0.0.1 -p 16379 -a 123456 info | grep -v "^#" | grep 'master_link_status' | awk -F: '{print $2}'` DATE=`date +%Y%m%d` if [ $STATUS != 'down' ];then echo "The redis-master-slave-satus is ok" else echo "The redis-master-slave-status is down, please to check redis-slave service status !!!" if [ ! -d /tmp/$DATE ];then mkdir -p /tmp/$DATE fi /usr/local/bin/redis-cli -h 127.0.0.1 -p 16379 -a 123456 info > /tmp/$DATE/redis-master-slave-status.txt /usr/local/bin/sendEmail -f [email protected] -t [email protected] -s smtp.exmail.qq.com -u "redis-master-slave-status" -xu [email protected] -xp 123456 -m "The redis-master-slave status is down" -a /tmp/$DATE/redis-master-slave-status.txt fi