作为一名MySQL DBA,首要的任务是维持数据库的可用性和稳定性,在生产中,有时候一条慢SQL会拖垮整个系统的响应和体验,所以慢SQL治理至关重要。
首先我们采用pt-query-digest
来进行慢sql采集和分析
mysql> show global variables where variable_name in ('slow_query_log','long_query_time','slow_query_log_file','log_queries_not_using_indexes','log_throttle_queries_not_using_indexes','min_examined_row_limit','log_slow_admin_statements','log_slow_slave_statements','log_output');
+----------------------------------------+----------------------------+
| Variable_name | Value |
+----------------------------------------+----------------------------+
| log_output | FILE |
| log_queries_not_using_indexes | ON |
| log_slow_admin_statements | OFF |
| log_slow_slave_statements | ON |
| log_throttle_queries_not_using_indexes | 100 |
| long_query_time | 0.500000 |
| min_examined_row_limit | 0 |
| slow_query_log | ON |
| slow_query_log_file | slowquery_201908161156.log |
+----------------------------------------+----------------------------+
9 rows in set (0.01 sec)
该mysql版本为percona5.7.21版本,控制slow log的参数主要为这几个:
这些参数我们目前主要使用log_output
,slow_query_log
,slow_query_log_file
,min_examined_row_limit
,log_throttle_queries_not_using_indexes
这几个
mysql> set global log_output = 'FILE';
Query OK, 0 rows affected (0.00 sec)
mysql> set global slow_query_log = 1;
Query OK, 0 rows affected (0.00 sec)
mysql> set global min_examined_row_limit = 0;
Query OK, 0 rows affected (0.00 sec)
mysql> set global log_throttle_queries_not_using_indexes = 1;
Query OK, 0 rows affected (0.00 sec)
mysql> set global slow_query_log_file = 'slowquery_201908161203.log';
Query OK, 0 rows affected (0.00 sec)
设置参数后开始安装pt工具
我们这里直接安装个全家桶(percona-toolkit
)吧,首先安装依赖包:
yum install -y perl perl-IO-Socket-SSL perl-DBD-MySQL perl-Time-HiRes perl-Digest-MD5 perl-ExtUtils-MakeMaker
然后下载percona-toolkit:
percona-toolkit
或者下载这个https://download.csdn.net/download/gpf19950411/11568412
这里是cenos7环境,如果是其他环境另行下载
我们直接安装 rpm -ivh percona-toolkit-3.0.13-1.el7.x86_64.rpm
pt-query-digest [OPTIONS] [FILES] [DSN]
--create-review-table 当使用--review参数把分析结果输出到表中时,如果没有表就自动创建。
--create-history-table 当使用--history参数把分析结果输出到表中时,如果没有表就自动创建。
--filter 对输入的慢查询按指定的字符串进行匹配过滤后再进行分析
--limit 限制输出结果百分比或数量,默认值是20,即将最慢的20条语句输出,如果是50%则按总响应时间占比从大到小排序,输出到总和达到50%位置截止。
--host mysql服务器地址
--user mysql用户名
--password mysql用户密码
--history 将分析结果保存到表中,分析结果比较详细,下次再使用--history时,如果存在相同的语句,且查询所在的时间区间和历史表中的不同,则会记录到数据表中,可以通过查询同一CHECKSUM来比较某类型查询的历史变化。
--review 将分析结果保存到表中,这个分析只是对查询条件进行参数化,一个类型的查询一条记录,比较简单。当下次使用--review时,如果存在相同的语句分析,就不会记录到数据表中。
--output 分析结果输出类型,值可以是report(标准分析报告)、slowlog(Mysql slow log)、json、json-anon,一般使用report,以便于阅读。
--since 从什么时间开始分析,值为字符串,可以是指定的某个”yyyy-mm-dd [hh:mm:ss]”格式的时间点,也可以是简单的一个时间值:s(秒)、h(小时)、m(分钟)、d(天),如12h就表示从12小时前开始统计。
--until 截止时间,配合—since可以分析一段时间内的慢查询。
1、分析慢查询文件:
pt-query-digest slow-query.log > slow-query-report.log
2、分析最近1小时内的查询:
pt-query-digest --since=1h slow-query.log > slow-before_1h.log
3、分析指定时间范围内的查询:
pt-query-digest slow-query.log --since '2019-08-08 10:30:00' --until '2019-08-08 10:35:00' > slow-query-5min.log
4、针对某个用户的慢查询
pt-query-digest --filter '($event->{user} || "") =~ m/^dba/i' slow-query.log > slow-query-fordbauser.log
5、查询所有所有的全表扫描或full join的慢查询
pt-query-digest --filter '(($event->{Full_scan} || "") eq "yes") ||(($event->{Full_join} || "") eq "yes")' slow-query.log> slow_query-full.log
6、把查询保存到t_slowq_review表
pt-query-digest --user=dbms –password=admin --review h=localhost,D=test,t=t_slowq_review --create-review-table slow-query.log
7、把查询保存到t_slowq_details表
pt-query-digest --user=dbms –password=admin --history h=localhost,D=test,t=t_slowq_details --create-history-table slow-query.log
8、分析指含有select语句的慢查询
pt-query-digest --filter '$event->{fingerprint} =~ m/^select/i' slow-query.log> slow-query-justselect.log
9、通过tcpdump抓取mysql的tcp协议数据,然后再分析
tcpdump -s 65535 -x -nn -q -tttt -i any -c 1000 port 3306 > mysql.tcp3306.txt
pt-query-digest --type tcpdump mysql.tcp3306.txt> slow-query-3306.log
10、分析binlog
mysqlbinlog mysql-bin.000011 > mysql-bin.000011.sql
pt-query-digest --type=binlog mysql-bin.000011.sql > mysql-bin.000011.log
11.分析general log
pt-query-digest --type=genlog general_3306.log > generallog_3306.log
这里采用直接入库的方式,然后会采用脚本去切割,可以使用时间间隔切割,然后删除,尝试过since util,发现文本越大,每次的解析时间都会很长,不太适合生产部署,可以采用切割日志,如果需要保留本地的话,可以拷贝走
请注意要加charset,要不然sql中中文会乱码
pt-query-digest \
--user=user --password=password --port=port \
--review h=ip,D=dbname,t=t_slowq_review \
--history h=ip,D=dbname,t=t_slowq_details \
--no-report --limit=100% --charset=utf8 \
--filter="\$event->{Bytes}=length(\$event->{arg}) and \$event->{instanceid}=15 and \$event->{hostname}='idc-mysql18' and \$event->{client}=\$event->{ip}" \
slow-query.log
在生产中由于可能存在分库,所以sql的checksum值可能会一致,所以我们加了每个实例的id进去来标识不同的数据库,完全靠checksum值来标记可能混淆
CREATE TABLE `t_slowq_details` (
`instanceid_max` int(11) NOT NULL,
`hostname_max` varchar(64) NOT NULL,
`client_max` varchar(64) DEFAULT NULL,
`user_max` varchar(64) NOT NULL,
`db_max` varchar(64) DEFAULT NULL,
`checksum` char(32) NOT NULL,
`sample` longtext NOT NULL,
`ts_min` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`ts_max` datetime(6) NOT NULL,
`ts_cnt` float DEFAULT NULL,
`Query_time_sum` float DEFAULT NULL,
`Query_time_min` float DEFAULT NULL,
`Query_time_max` float DEFAULT NULL,
`Query_time_pct_95` float DEFAULT NULL,
`Query_time_stddev` float DEFAULT NULL,
`Query_time_median` float DEFAULT NULL,
`Lock_time_sum` float DEFAULT NULL,
`Lock_time_min` float DEFAULT NULL,
`Lock_time_max` float DEFAULT NULL,
`Lock_time_pct_95` float DEFAULT NULL,
`Lock_time_stddev` float DEFAULT NULL,
`Lock_time_median` float DEFAULT NULL,
`Rows_sent_sum` float DEFAULT NULL,
`Rows_sent_min` float DEFAULT NULL,
`Rows_sent_max` float DEFAULT NULL,
`Rows_sent_pct_95` float DEFAULT NULL,
`Rows_sent_stddev` float DEFAULT NULL,
`Rows_sent_median` float DEFAULT NULL,
`Rows_examined_sum` float DEFAULT NULL,
`Rows_examined_min` float DEFAULT NULL,
`Rows_examined_max` float DEFAULT NULL,
`Rows_examined_pct_95` float DEFAULT NULL,
`Rows_examined_stddev` float DEFAULT NULL,
`Rows_examined_median` float DEFAULT NULL,
`Rows_affected_sum` float DEFAULT NULL,
`Rows_affected_min` float DEFAULT NULL,
`Rows_affected_max` float DEFAULT NULL,
`Rows_affected_pct_95` float DEFAULT NULL,
`Rows_affected_stddev` float DEFAULT NULL,
`Rows_affected_median` float DEFAULT NULL,
`Rows_read_sum` float DEFAULT NULL,
`Rows_read_min` float DEFAULT NULL,
`Rows_read_max` float DEFAULT NULL,
`Rows_read_pct_95` float DEFAULT NULL,
`Rows_read_stddev` float DEFAULT NULL,
`Rows_read_median` float DEFAULT NULL,
`Merge_passes_sum` float DEFAULT NULL,
`Merge_passes_min` float DEFAULT NULL,
`Merge_passes_max` float DEFAULT NULL,
`Merge_passes_pct_95` float DEFAULT NULL,
`Merge_passes_stddev` float DEFAULT NULL,
`Merge_passes_median` float DEFAULT NULL,
`InnoDB_IO_r_ops_min` float DEFAULT NULL,
`InnoDB_IO_r_ops_max` float DEFAULT NULL,
`InnoDB_IO_r_ops_pct_95` float DEFAULT NULL,
`InnoDB_IO_r_ops_stddev` float DEFAULT NULL,
`InnoDB_IO_r_ops_median` float DEFAULT NULL,
`InnoDB_IO_r_bytes_min` float DEFAULT NULL,
`InnoDB_IO_r_bytes_max` float DEFAULT NULL,
`InnoDB_IO_r_bytes_pct_95` float DEFAULT NULL,
`InnoDB_IO_r_bytes_stddev` float DEFAULT NULL,
`InnoDB_IO_r_bytes_median` float DEFAULT NULL,
`InnoDB_IO_r_wait_min` float DEFAULT NULL,
`InnoDB_IO_r_wait_max` float DEFAULT NULL,
`InnoDB_IO_r_wait_pct_95` float DEFAULT NULL,
`InnoDB_IO_r_wait_stddev` float DEFAULT NULL,
`InnoDB_IO_r_wait_median` float DEFAULT NULL,
`InnoDB_rec_lock_wait_min` float DEFAULT NULL,
`InnoDB_rec_lock_wait_max` float DEFAULT NULL,
`InnoDB_rec_lock_wait_pct_95` float DEFAULT NULL,
`InnoDB_rec_lock_wait_stddev` float DEFAULT NULL,
`InnoDB_rec_lock_wait_median` float DEFAULT NULL,
`InnoDB_queue_wait_min` float DEFAULT NULL,
`InnoDB_queue_wait_max` float DEFAULT NULL,
`InnoDB_queue_wait_pct_95` float DEFAULT NULL,
`InnoDB_queue_wait_stddev` float DEFAULT NULL,
`InnoDB_queue_wait_median` float DEFAULT NULL,
`InnoDB_pages_distinct_min` float DEFAULT NULL,
`InnoDB_pages_distinct_max` float DEFAULT NULL,
`InnoDB_pages_distinct_pct_95` float DEFAULT NULL,
`InnoDB_pages_distinct_stddev` float DEFAULT NULL,
`InnoDB_pages_distinct_median` float DEFAULT NULL,
`QC_Hit_cnt` float DEFAULT NULL,
`QC_Hit_sum` float DEFAULT NULL,
`Full_scan_cnt` float DEFAULT NULL,
`Full_scan_sum` float DEFAULT NULL,
`Full_join_cnt` float DEFAULT NULL,
`Full_join_sum` float DEFAULT NULL,
`Tmp_table_cnt` float DEFAULT NULL,
`Tmp_table_sum` float DEFAULT NULL,
`Tmp_table_on_disk_cnt` float DEFAULT NULL,
`Tmp_table_on_disk_sum` float DEFAULT NULL,
`Filesort_cnt` float DEFAULT NULL,
`Filesort_sum` float DEFAULT NULL,
`Filesort_on_disk_cnt` float DEFAULT NULL,
`Filesort_on_disk_sum` float DEFAULT NULL,
`Bytes_sum` float DEFAULT NULL,
`Bytes_min` float DEFAULT NULL,
`Bytes_max` float DEFAULT NULL,
`Bytes_pct_95` float DEFAULT NULL,
`Bytes_stddev` float DEFAULT NULL,
`Bytes_median` float DEFAULT NULL,
PRIMARY KEY (`instanceid_max`,`checksum`,`ts_min`,`ts_max`) USING BTREE,
KEY `idx_hostname_max_ts_min` (`instanceid_max`,`ts_min`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
/*!50100 PARTITION BY RANGE (UNIX_TIMESTAMP(ts_min))
(PARTITION p201908 VALUES LESS THAN (1567267200) ENGINE = InnoDB,
PARTITION p201909 VALUES LESS THAN (1569859200) ENGINE = InnoDB,
PARTITION p201910 VALUES LESS THAN (1572537600) ENGINE = InnoDB,
PARTITION p201911 VALUES LESS THAN (1575129600) ENGINE = InnoDB,
PARTITION p201912 VALUES LESS THAN (1577808000) ENGINE = InnoDB) */
CREATE TABLE `t_slowq_review` (
`checksum` char(32) NOT NULL,
`fingerprint` longtext NOT NULL,
`sample` longtext NOT NULL,
`first_seen` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`last_seen` datetime(6) DEFAULT NULL,
`reviewed_by` varchar(20) DEFAULT NULL,
`reviewed_on` datetime(6) DEFAULT NULL,
`comments` longtext,
`reviewed_status` varchar(24) DEFAULT NULL,
PRIMARY KEY (`checksum`) USING BTREE,
KEY `idx_last_seen` (`last_seen`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8
这个脚本只是提供一个思路,逻辑就是,会自动去切割slowlog,然后解析上传,切割后解析玩会删除前一个日志,如果需要保留本地日志可以把代码里面的删除改成拷贝走,其中我提出的instance_id是我们觉得生产中只有checksum没法和系统进行结合和日常使用,各位可以参考使用,并进行改造,其中有啥问题可以留言
#!/usr/bin/env python
# -*- coding:utf-8 -*-
# create_time: 2019-08-08 11:20
import sys
import os
import socket
import time
from multiprocessing import Process
import subprocess
import pymysql as MySQLdb
DBMS_HOST = "" # 存储数据库的ip
DBMS_NAME = "" # 存储数据库的dbname
DBMS_PORT = # 存储数据库的端口
DBMS_PWD = "" # 存储数据库的账号密码
DBMS_USER = "" # 存储数据库的账号
DBA_HOST = "" # 被采集实例的ip,一般默认为 127.0.0.1
DBA_USER = "" # 被采集实例的 用户
DBA_PASSWORD = "" # 被采集实例 用户密码
# 一般数据实例上建立一个 dba管理账号来使用
class MySQLConnection(object):
def __init__(self, host, port, user, passwd, db=None, charset="utf8"):
self.host = host
self.port = port
self.user = user
self.passwd = passwd
self.db = db
self.charset = charset
self.error = ""
def Success(self):
return self.error == ""
def Connect(self):
self.error = ""
try:
self.conn = MySQLdb.connect(host=self.host, port=self.port, user=self.user, passwd=self.passwd,
charset=self.charset)
if self.db:
self.conn.select_db(self.db)
self.cursor = self.conn.cursor()
except Exception as e:
self.error = str(e)
def Close(self):
self.error = ""
try:
self.cursor.close()
self.conn.close()
except Exception as e:
self.error = str(e)
def Fileds(self):
res = []
try:
for r in self.cursor.description:
res.append(r[0])
except:
pass
return res
def Query(self, sql, params=None, rownum=[0], fetch=0):
self.error = ""
data = ()
try:
if params:
rownum[0] = self.cursor.execute(sql, params)
else:
rownum[0] = self.cursor.execute(sql)
if rownum[0] > 0:
if fetch == 0:
data = self.cursor.fetchall()
elif fetch == 1:
data = self.cursor.fetchone()
except Exception as e:
self.error = str(e)
return data
def Execute(self, sql, params=None):
self.error = ""
try:
if params:
self.cursor.execute(sql, params)
else:
self.cursor.execute(sql)
self.conn.commit()
except Exception as e:
self.error = str(e)
return False
return True
def RunCommand(c):
p = subprocess.Popen(c, stdout=subprocess.PIPE, stderr = subprocess.PIPE, shell = True)
(stdout, stderr) = p.communicate()
return (stdout.decode("utf-8"), stderr.decode("utf-8"))
def log_out(instance_id, start = time.time(), error = None):
"""
这里是用来记录慢sql解析情况的,会上报相关状态
"""
dbms_con = MySQLConnection(DBMS_HOST, DBMS_PORT, DBMS_USER, DBMS_PWD, DBMS_NAME)
dbms_con.Connect()
if dbms_con.error:
return
start_time = time.strftime('%Y-%m-%d %T', time.localtime(start))
dbms_con.Execute("replace into t_slowq_message(instance_id, error, start_time) values(%s, %s, %s)", (instance_id, error, start_time))
dbms_con.Close()
def GetSlowlogfile(instanceid,port):
"""
slowlog = datadir+slowlog
:param port:
:return:
"""
local_conn = MySQLConnection(host=DBA_HOST, port=port, user=DBA_USER, passwd=DBA_PASSWORD)
local_conn.Connect()
if not local_conn.error:
data = local_conn.Query("show global variables where variable_name in ('long_query_time', 'slow_query_log_file','datadir')")
if data:
res = {
data[0][0]: data[0][1],
data[1][0]: data[1][1],
data[2][0]: data[2][1],
}
local_conn.Close()
return res
local_conn.Close()
error = "get slowlog_file paremeters failed"
log_out(instance_id=instanceid, error=error)
return None
else:
log_out(instance_id=instanceid, error=str(local_conn.error))
local_conn.Close()
return None
def SwitchSlowlogfile(port):
local_conn = MySQLConnection(host=DBA_HOST, port=port, user=DBA_USER, passwd=DBA_PASSWORD)
local_conn.Connect()
if local_conn.error:
return False
tmp_log = "slowquery_%s.log" % (time.strftime('%Y%m%d%H%M', time.localtime(time.time())),)
res = local_conn.Execute("set global slow_query_log_file = %s", tmp_log)
return True
def CollectSlowlog(instanceid,port,hostname):
res = GetSlowlogfile(instanceid,port)
if res:
slowlog_time = eval(res["long_query_time"])
slowlog_file = res["slow_query_log_file"]
data_dir = res['datadir']
slow_log = data_dir+slowlog_file
if SwitchSlowlogfile(port):
try:
pt_query_digest = RunCommand("whereis pt-query-digest")[0].split()[1]
except:
pt_query_digest = "/usr/bin/pt-query-digest"
cmd = "%s --user=%s --password=%s --port=%s --review h=%s,D=%s,t=%s --history h=%s,D=%s,t=%s --no-report --limit=100%% --charset=utf8 " \
"--filter=\"\\$event->{Bytes} = length(\\$event->{arg}) and \\$event->{instanceid}=%s and \\$event->{hostname}='%s'" \
" and \$event->{client}=\$event->{ip}\" %s " % (
pt_query_digest, DBMS_USER, DBMS_PWD, DBMS_PORT, DBMS_HOST, DBMS_NAME,
"t_slowq_review", DBMS_HOST, DBMS_NAME, "t_slowq_details", instanceid, hostname,slow_log)
out, error = RunCommand(cmd)
if not error:
os.remove(slow_log)
log_out(instance_id=instanceid)
else:
error = "parse slowlog_file failed"
log_out(instance_id=instanceid,error=error)
else:
error = "switch slowlog_file failed"
log_out(instance_id=instanceid, error=error)
def Main():
"""
该部分可以自己进行改造,我这边的逻辑是我有一张记录实例的元数据库,来查询当前主机上面的实例,然后获取实例id和port,各位可以自行改造
"""
try:
hostname = socket.gethostname()
dbms_con = MySQLConnection(DBMS_HOST, DBMS_PORT, DBMS_USER, DBMS_PWD, DBMS_NAME)
dbms_con.Connect()
if dbms_con.error:
return
sql1 = "SELECT a.instance_id,a.port FROM t_info_instance AS a JOIN t_info_machine AS b ON a.m_id = b.m_id WHERE b.hostname = %s"
res1 = dbms_con.Query(sql1, (hostname,))
if res1 != None and len(res1) > 0:
for instance in res1:
instance_id = instance[0]
port = instance[1]
p = Process(target=CollectSlowlog,args=(instance_id,port,hostname))
p.start()
dbms_con.Close()
except Exception as e:
print (e)
if __name__ == "__main__":
Main()
另一张状态表表结构:
CREATE TABLE `t_slowq_message` (
`instance_id` int(11) NOT NULL,
`error` text,
`start_time` datetime NOT NULL,
`end_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`instance_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
有什么问题可以留言,提意见。。
后续可能会采用入es的方案,后续有时间更新。。