在我们的IT职业生涯中,能有一次百万级的数据库的优化经历是很难得的,如果你遇到了恭喜你,你的职业生涯将会更加完美,如果你遇到并解决了,那么一定足够你炫耀很多年。
这里我将要分享一次完美的百万级数据库优化经历,希望能给在IT行业的小伙伴一点启发和帮助。本篇分三步走:1、遇到问题;2、分析问题;3、解决问题;
我负责一个web项目,数据主表差不多500w的历史数据,最近老是听运营部门反馈系统卡顿,甚至发生过站点宕机的情况。
这里省略一万字,经历了各种排查,最终定性为db性能问题,进入下一步。
面对上面的问题,经过了各种常规慢查询、sql跟踪查询分析、连接数查询,始终无法解决问题,后来通过各种渠道的打听,得到了以下解决方法。
第一步,获取数据库表阻塞sql信息:
select * from (
SELECT spid,
blocked,
DB_NAME(sp.dbid) AS DBName,
program_name,
waitresource,
lastwaittype,
sp.loginame,
sp.hostname,
a.[Text] AS [TextData],
SUBSTRING(A.text, sp.stmt_start / 2,
(CASE WHEN sp.stmt_end = -1 THEN DATALENGTH(A.text) ELSE sp.stmt_end
END - sp.stmt_start) / 2) AS [current_cmd]
FROM sys.sysprocesses AS sp OUTER APPLY sys.dm_exec_sql_text (sp.sql_handle) AS A
) a join (select blocked as blocked2
from (select * from sys.sysprocesses where blocked>0 ) a
where not exists(select * from (select * from sys.sysprocesses where blocked>0 ) b
where a.blocked=spid)) b on a.blocked=b.blocked2
第二步,获取当前执行所有sql信息:
SELECT SPID = er.session_id,Status = ses.status,[Login] = ses.login_name,Host = ses.host_name,BlkBy = er.blocking_session_id,
DBName = DB_Name(er.database_id),CommandType = er.command,SQLStatement = st.text,ObjectName = OBJECT_NAME(st.objectid),
ElapsedMS = er.total_elapsed_time,CPUTime = er.cpu_time,IOReads = er.logical_reads + er.reads,IOWrites = er.writes,
LastWaitType = er.last_wait_type,StartTime = er.start_time,Protocol = con.net_transport,ConnectionWrites = con.num_writes,
ConnectionReads = con.num_reads,ClientAddress = con.client_net_address,Authentication = con.auth_scheme
FROM sys.dm_exec_requests er OUTER APPLY sys.dm_exec_sql_text(er.sql_handle) st
LEFT JOIN sys.dm_exec_sessions ses ON ses.session_id = er.session_id
LEFT JOIN sys.dm_exec_connections con ON con.session_id = ses.session_id
WHERE er.session_id > 50 ORDER BY er.blocking_session_id DESC,er.session_id
第三步,添加自动监控程序,每隔10s查询需要监控的数据库信息,然后把表阻塞和所有执行sql情况进行入库分析。
private void Process()
{
//获取itom监控数据
string strBlock = @"select * from (
SELECT spid,
blocked,
DB_NAME(sp.dbid) AS DBName,
program_name,
waitresource,
lastwaittype,
sp.loginame,
sp.hostname,
a.[Text] AS [TextData],
SUBSTRING(A.text, sp.stmt_start / 2,
(CASE WHEN sp.stmt_end = -1 THEN DATALENGTH(A.text) ELSE sp.stmt_end
END - sp.stmt_start) / 2) AS [current_cmd]
FROM sys.sysprocesses AS sp OUTER APPLY sys.dm_exec_sql_text (sp.sql_handle) AS A
) a join (select blocked as blocked2
from (select * from sys.sysprocesses where blocked>0 ) a
where not exists(select * from (select * from sys.sysprocesses where blocked>0 ) b
where a.blocked=spid)) b on a.blocked=b.blocked2";
string strBlockInfo = @"SELECT SPID = er.session_id,Status = ses.status,[Login] = ses.login_name,Host = ses.host_name,BlkBy = er.blocking_session_id,
DBName = DB_Name(er.database_id),CommandType = er.command,SQLStatement = st.text,ObjectName = OBJECT_NAME(st.objectid),
ElapsedMS = er.total_elapsed_time,CPUTime = er.cpu_time,IOReads = er.logical_reads + er.reads,IOWrites = er.writes,
LastWaitType = er.last_wait_type,StartTime = er.start_time,Protocol = con.net_transport,ConnectionWrites = con.num_writes,
ConnectionReads = con.num_reads,ClientAddress = con.client_net_address,Authentication = con.auth_scheme
FROM sys.dm_exec_requests er OUTER APPLY sys.dm_exec_sql_text(er.sql_handle) st
LEFT JOIN sys.dm_exec_sessions ses ON ses.session_id = er.session_id
LEFT JOIN sys.dm_exec_connections con ON con.session_id = ses.session_id
WHERE er.session_id > 50 ORDER BY er.blocking_session_id DESC,er.session_id";
DataTable dtBlock = SqlHelper.ExecuteDataTable(connectionTTItom, CommandType.Text, strBlock);
DataTable dtSqlInfo = SqlHelper.ExecuteDataTable(connectionTTItom, CommandType.Text, strBlockInfo);
//数据入库
if (dtBlock != null && dtBlock.Rows.Count > 0)
{
foreach (DataRow dr in dtBlock.Rows)
{
try
{
Insert_BlockedInfo(dr, "itom");
}
catch (Exception ex)
{
_logger.Info("Insert_BlockedInfo方法异常,"+ex.ToString());
}
}
}
if (dtSqlInfo != null && dtSqlInfo.Rows.Count > 0)
{
foreach (DataRow dr in dtSqlInfo.Rows)
{
try
{
Insert_SqlInfo(dr, "itom");
}
catch (Exception ex)
{
_logger.Info("Insert_SqlInfo方法异常," + ex.ToString());
}
}
}
}
有了以上的准备工作,有了所有的db执行情况数据,我们就很容易对db进行分析了。当业务反馈访问异常的时候,我们只需要打开监控数据,根据监控时间范围就能很容易得看到导致表阻塞的语句。
select top 200 * from dbo.Monitor_BlockedInfo where createtime>'xxxx' and createtime<'xxxx' order by id desc
除了查看当前阻塞语句,我们还可以根据当前时间点的并行sql数量来监控数据库高负载时间段,具体情况如下:
-- 先检查时间段内sql堆积数量
select convert(char(19),createtime,120),COUNT(1)
from dbo.Monitor_SqlInfo
where createtime>'2023-07-17 08:00:00' --and createtime<'2023-06-21 08:59:00'
group by convert(char(19),createtime,120)
order by COUNT(1) desc
最后,我们根据高负载时间点,查询当前时间点的所有sql执行情况,有了上面的数据,db的运行情况就被拿捏的稳稳的了。