在我们的IT职业生涯中,能有一次百万级的数据库的优化经历是很难得的,如果你遇到了恭喜你,你的职业生涯将会更加完美,如果你遇到并解决了,那么一定足够你炫耀很多年。
这里我将要分享一次完美的百万级数据库优化经历,希望能给在IT行业的小伙伴一点启发和帮助。本篇分三步走:1、遇到问题;2、分析问题;3、解决问题;
1、遇到问题
我负责一个web项目,数据主表差不多500w的历史数据,最近老是听运营部门反馈系统卡顿,甚至发生过站点宕机的情况。
这里省略一万字,经历了各种排查,最终定性为db性能问题,进入下一步。
2、分析问题
面对上面的问题,经过了各种常规慢查询、sql跟踪查询分析、连接数查询,始终无法解决问题,后来通过各种渠道的打听,得到了以下解决方法。
第一步,获取数据库表阻塞sql信息:
select * from ( SELECT spid, blocked, DB_NAME(sp.dbid) AS DBName, program_name, waitresource, lastwaittype, sp.loginame, sp.hostname, a.[Text] AS [TextData], SUBSTRING(A.text, sp.stmt_start / 2, (CASE WHEN sp.stmt_end = -1 THEN DATALENGTH(A.text) ELSE sp.stmt_end END - sp.stmt_start) / 2) AS [current_cmd] FROM sys.sysprocesses AS sp OUTER APPLY sys.dm_exec_sql_text (sp.sql_handle) AS A) a join (select blocked as blocked2from (select * from sys.sysprocesses where blocked>0 ) awhere not exists(select * from (select * from sys.sysprocesses where blocked>0 ) bwhere a.blocked=spid)) b on a.blocked=b.blocked2
第二步,获取当前执行所有sql信息:
SELECTSPID = er.session_id,Status = ses.status,[Login] = ses.login_name,Host = ses.host_name,BlkBy = er.blocking_session_id,DBName = DB_Name(er.database_id),CommandType = er.command,SQLStatement = st.text,ObjectName = OBJECT_NAME(st.objectid),ElapsedMS = er.total_elapsed_time,CPUTime = er.cpu_time,IOReads = er.logical_reads + er.reads,IOWrites = er.writes,LastWaitType = er.last_wait_type,StartTime = er.start_time,Protocol = con.net_transport,ConnectionWrites = con.num_writes,ConnectionReads = con.num_reads,ClientAddress = con.client_net_address,Authentication = con.auth_schemeFROM sys.dm_exec_requests erOUTER APPLY sys.dm_exec_sql_text(er.sql_handle) stLEFT JOIN sys.dm_exec_sessions sesON ses.session_id = er.session_idLEFT JOIN sys.dm_exec_connections conON con.session_id = ses.session_idWHERE er.session_id > 50ORDER BYer.blocking_session_id DESC,er.session_id
第三步,添加自动监控程序,每隔10s查询需要监控的数据库信息,然后把表阻塞和所有执行sql情况进行入库分析。
private void Process(){//获取itom监控数据string strBlock = @"select * from ( SELECT spid, blocked, DB_NAME(sp.dbid) AS DBName, program_name, waitresource, lastwaittype, sp.loginame, sp.hostname, a.[Text] AS [TextData], SUBSTRING(A.text, sp.stmt_start / 2, (CASE WHEN sp.stmt_end = -1 THEN DATALENGTH(A.text) ELSE sp.stmt_end END - sp.stmt_start) / 2) AS [current_cmd] FROM sys.sysprocesses AS sp OUTER APPLY sys.dm_exec_sql_text (sp.sql_handle) AS A) a join (select blocked as blocked2from (select * from sys.sysprocesses where blocked>0 ) awhere not exists(select * from (select * from sys.sysprocesses where blocked>0 ) bwhere a.blocked=spid)) b on a.blocked=b.blocked2";string strBlockInfo = @"SELECTSPID = er.session_id,Status = ses.status,[Login] = ses.login_name,Host = ses.host_name,BlkBy = er.blocking_session_id,DBName = DB_Name(er.database_id),CommandType = er.command,SQLStatement = st.text,ObjectName = OBJECT_NAME(st.objectid),ElapsedMS = er.total_elapsed_time,CPUTime = er.cpu_time,IOReads = er.logical_reads + er.reads,IOWrites = er.writes,LastWaitType = er.last_wait_type,StartTime = er.start_time,Protocol = con.net_transport,ConnectionWrites = con.num_writes,ConnectionReads = con.num_reads,ClientAddress = con.client_net_address,Authentication = con.auth_schemeFROM sys.dm_exec_requests erOUTER APPLY sys.dm_exec_sql_text(er.sql_handle) stLEFT JOIN sys.dm_exec_sessions sesON ses.session_id = er.session_idLEFT JOIN sys.dm_exec_connections conON con.session_id = ses.session_idWHERE er.session_id > 50ORDER BYer.blocking_session_id DESC,er.session_id";DataTable dtBlock = SqlHelper.ExecuteDataTable(connectionTTItom, CommandType.Text, strBlock);DataTable dtSqlInfo = SqlHelper.ExecuteDataTable(connectionTTItom, CommandType.Text, strBlockInfo);//数据入库if (dtBlock != null && dtBlock.Rows.Count > 0){foreach (DataRow dr in dtBlock.Rows){try{Insert_BlockedInfo(dr, "itom");}catch (Exception ex){_logger.Info("Insert_BlockedInfo方法异常,"+ex.ToString());}}}if (dtSqlInfo != null && dtSqlInfo.Rows.Count > 0){foreach (DataRow dr in dtSqlInfo.Rows){try{Insert_SqlInfo(dr, "itom");}catch (Exception ex){_logger.Info("Insert_SqlInfo方法异常," + ex.ToString());}}}}
3、解决问题
有了以上的准备工作,有了所有的db执行情况数据,我们就很容易对db进行分析了。当业务反馈访问异常的时候,我们只需要打开监控数据,根据监控时间范围就能很容易得看到导致表阻塞的语句。
select top 200 * from dbo.Monitor_BlockedInfo where createtime>’xxxx’ and createtime<'xxxx' order by id desc
除了查看当前阻塞语句,我们还可以根据当前时间点的并行sql数量来监控数据库高负载时间段,具体情况如下:
— 先检查时间段内sql堆积数量
select convert(char(19),createtime,120),COUNT(1)
from dbo.Monitor_SqlInfo
where createtime>’2023-07-17 08:00:00′ –and createtime<'2023-06-21 08:59:00'
group by convert(char(19),createtime,120)
order by COUNT(1) desc
最后,我们根据高负载时间点,查询当前时间点的所有sql执行情况,有了上面的数据,db的运行情况就被拿捏的稳稳的了。