When troubleshooting Transactional Replication performance break down the flow of data into 4 simultaneous conversation. Examine performance of each conversation. This will help verify where the bottleneck investigation should begin.
1) LogReader Reader thread is reading the Transaction Log via stored procedure sp_replcmds, a wrapper for xp_replcmds. It scans the transaction log for transactions marked for replication, skipping not replicated transactions.
2) LogReader Writer thread writes queued transaction from the Reader thread into the Distribution database using sp_MSadd_replcmds.
3) Distribution Reader thread executes sp_MSget_repl_commands query to retrieve pending commands from the Distribution database and storing in an internal queue.
4) Distribution Writer thread writing queue commands to the Subscriber via parameterized stored procedures prefixed with sp_MSupd..., sp_MSins..., sp_MSdel... to apply individual row changes to each article at the subscriber.
5) Log Reader and Distributor are also executing a “history” thread writing summary data to MSlogreader_history and MSdistribution_history system tables of the distribution database. |
Tools to assist with troubleshooting performance
SQL Profiler: There are multiple methods to examine in detail the Transaction Replication conversations. As with all application connecting to SQL Server, the individual commands being executed by Replication components can be captured using SQL Profiler tool. The Profile trace can be sorted or queried to find long running statements.
Examples of which ApplicationNames you will find:
Example of Replication Stored Procedure calls:
Activity Monitor: SQL Management Studio Activity Monitor, Performance Dashboard for SQL 2005, Performance Data Warehouse for SQL 2008, or system stored procedures help find blocking, disk bottlenecks, and statements causing high IO and high CPU. To help identity Replication components look for the LogReader or Distribution Agent Job Names as the Program_Name connecting to SQL Server. To see a list of the Replication Jobs, run the query below
SELECT name, description, enabled from MSDB..sysjobs
WHERE category_id>10 and category_id<20
Tracer Tokens: To gain a quick high-level overview of Replication performance use the Tracer Token feature in Replication Monitor. The features provides “Publisher to Distributor” and “Distributor to Subscriber” latency values. To use this feature, launch Replication Monitor. In the My Publishers list, drill in and click on the publication you wish to test. In the right pane, choose the Tracer Token tab and click the “Insert Tracer Token” button. There will be a list of subscriptions below. Monitor the “Publisher to Distributor” and “Distributor to Subscriber” values.
This method is only helpful in where times are somewhat real-time. If the topology is behind 3 days, then you will never see the tracer token make it to the subscriber in time, so the only thing you can see is if the Log Reader is keeping up or getting behind. That is still valuable information, but more trivial than if this is a problem where the subscriber is only 5 minutes behind and you can see the full flight of the token from start to finish.
If you’re seeing long time for “Publisher to Distributor” to populate, focus on the troubleshooting the LogReader performance. If LogReader performance is fast while the “Distributor to Subscriber” values are high, transactions are moving quickly from the Published database transaction log to the Distribution database but slow from the Distribution database to the Subscribers. The Distribution Agent should be the focus of the bottleneck investigation.
The following links provide sample code to measure latency over time.
Performance Monitor: You can also use Windows Performance Monitor's to track replication LogReader 'LogReader:Delivery Latency' counter or the replication Distribution 'Dist:Delivery Latency' counter. However, Windows Performance Monitor's Replication counter values are only updated after the Log Reader or Distribution Agent completes a processing phase and records performance stats. The counter values will report 0 commands/sec and transactions/sec while a large transaction consisting of millions of rows is being processed.
General Troubleshooting:
Adding the “-OUTPUT” parameter to the Log Reader breaks down the time for both Reader (Fetch time), and the Writer (Write time)conversations. In the sample output below the Reader-Thread took 55 minutes while the Writer-Thread needed only 3 minutes to processes the transaction indicating Log Reader Reader-Thread is the bottleneck.
Replication Log Reader Agent
http://msdn.microsoft.com/en-us/library/ms146878.aspx
******************** STATISTICS SINCE AGENT STARTED ***************************
Execution time (ms): 3401930
Work time (ms): 3401470
Distribute Repl Cmds Time(ms): 3354520
Fetch time(ms): 3343920 <-- read time 55 minutes
Repldone time(ms): 5520
Write time(ms): 140070 <-- write time 3 minutes
Num Trans: 8 Num Trans/Sec: 0.002352
Num Cmds: 477219 Num Cmds/Sec: 140.297871
*******************************************************************************
Starting with SQL Server builds below the Log Reader Agent stats are displayed ever 5 minutes and written to the MSlogreader_history table in the Distribution database.
• |
Cumulative Update 12 for SQL Server 2005 Service Pack 2 |
• |
Cumulative Update 5 for SQL Server 2008 |
• |
Cumulative Update 2 for SQL Server 2008 Service Pack 1 |
Possible Cause: large batch of replicated transaction
How large is the transaction Log for this published database? Has it grown to an unexpected size causing longer read times by the Log Reader thread.
DBCC SQLPERF(LOGSPACE)
The reader thread of the Log Reader executes sp_replcmds to pull commands from the transaction log. The stored procedures below can be executed in SQL Server Management Studio to track time required to pull next pending replicated transaction. How long do these command take to execute? How many commands were returned? Scanning a large transaction log for a single transaction containing a million+ row modifications may take 20+ minutes to complete. The execution time for this query approximates the startup time for the Log Reader to replicate this same transaction.
Display 1st pending transaction metadata such as publication_id, article id, xactid, and command_type. Is this command part of normal database operation or an unexpected large batch of commands?
sp_replcmds [ @maxtrans = ] maxtrans
http://msdn.microsoft.com/en-us/library/ms186983.aspx
example:
--Return all commands from the next pending transaction
sp_replcmds @maxtrans = 1
GO
--Return all commands for the next 500 transactions
sp_replcmds @maxtrans = 500
GO
sp_replflush
GO
Display 1st transactions metadata along with text of commands yet to replicate. Commands type = 1073741871 are pending Tracer Tokens.
sp_replshowcmds [ @maxtrans = ] maxtrans
http://msdn.microsoft.com/en-us/library/ms175114.aspx
example:
--Return all commands from the next pending transaction with CMD text
sp_replshowcmds @maxtrans = 1
GO
--Return all commands for the next 500 transactions with CMD text
sp_replshowcmds @maxtrans = 500
GO
sp_replflush
GO
Another technique to see summary of a batch of pending transactions/commands is to querysys.dm_repl_traninfo view. This displays summary data on replicated transactions in the log last read by sp_replcmds or sp_replshowcmds, including the “cmds_in_tran.” A high value may indicate a large transaction is being replicated.
sys.dm_repl_traninfo
http://msdn.microsoft.com/en-us/library/ms180036.aspx
sp_replcmds @maxtrans = 500 -- retrieve 500 pending transactions
GO
SELECT dbid, begin_lsn, commit_lsn, rows, cmds_in_tran
FROM sys.dm_repl_traninfo
GO
dbid begin_lsn commit_lsn rows cmds_in_tran
------ -------------------------------- -------------------------------- ----------- ------------
10 0000081A:0000013A:0001 0000081A:0000013A:0004 1 1
10 0000081A:00000140:0001 0000081A:00000140:0004 1 1
10 0000081A:00021385:0001 0000081E:00004EA2:0009 6874 6874
The following commands can be used to determine the number of non-replicated vs. replicated transactions. A transaction log with a high percentage of non-replicated transaction will cause latency as the Log Reader scans over transaction to be ignored. These can be database maintenance transactions such as an online reinidex or data modifications to table not being replicated. A 25gig transaction log may take up to 15 minutes to scan depending on the IO subsystem.
Use <published database>
GO
-- Total records in the log
SELECT count(*) FROM ::fn_dblog(NULL, NULL)
GO
-- Records marked for REPLICATION
SELECT count(*) FROM ::fn_dblog(NULL, NULL) WHERE Description='REPLICATE'
GO
If you catch the large transaction in the act of being processed by sp_replcmds, you should also see the long run time of that event in SQL Profiler trace data. Using a captured Profile trace run command below to query the results:
SELECT duration, starttime, endtime, textdata
FROM ::fn_trace_gettable('C:\PUBLISHERDB_sp_trace.trc', DEFAULT)
WHERE TEXTDATA LIKE '%SP_REPLCMDS%' AND EVENTCLASS=10
Duration StartTime EndTime TextData
388 2008-10-21 15:00:06.010 2008-10-21 15:00:06.010 exec sp_replcmds 500,0,-1,0x,5000,0,500000
393 2008-10-21 15:00:11.010 2008-10-21 15:00:11.010 exec sp_replcmds 500,0,-1,0x,5000,0,500000
397 2008-10-21 15:00:16.010 2008-10-21 15:00:16.010 exec sp_replcmds 500,0,-1,0x,5000,0,500000
66302488 2008-10-21 15:00:21.010 2008-10-21 15:01:27.290 exec sp_replcmds 500,0,-1,0x,5000,0,500000
413 2008-10-21 15:01:27.353 2008-10-21 15:01:27.353 exec sp_replcmds 500,0,-1,0x,66344,0,500000
Shown above, processing the large transaction causes sp_replcmds took 66 seconds to execute.
Resolution:
If Reader Latency is caused by large number of pending commands, waiting for the Log Reader to catch up may be the best short-term solution. Long-term options include replicating batches during non-peak time or as stored procedure executions.
Publishing Stored Procedure Execution in Transactional Replication
http://msdn.microsoft.com/en-us/library/ms152754(SQL.90).aspx
Another option to consider is configuring the Log Reader to break up the one large transaction into multiple batches. By committing large transaction in smaller batches we may be able to reduce the Log Reader overhead. However, performance improvement would be most noticeable when Distributing data to subscribers in multiple parallel STREAMS.
Replication Log Reader Agent
http://msdn.microsoft.com/en-us/library/ms146878.aspx
Depending on estimated time to read pending transactions and transfer them to the subscribers, it may be faster to mark all transactions as “replicated”, then reinitialize the subscribers with a new snapshot or via backup/restore. This is step should only be taken if the time to generated a new Replication Shapshot and deliver to subscriber is faster than waiting for individual pending commands to be replicated.
sp_repldone (Transact-SQL)
http://msdn.microsoft.com/en-us/library/ms173775.aspx
EXEC sp_repldone @xactid = NULL, @xact_segno = NULL, @numtrans = 0,@time = 0, @reset = 1
Possible Cause: large number of non-replication transactions:
This can be more challenging to identify. It helps to have a good knowledge of which articles are being replicated and what batch changes have been recently performed. The BINARY column in the Profile Trace represents the number of rows processed by sp_replcmds. Multiple executions with few rows usually indicates LogReader spent time skipping over non-replicated transactions such as large table performing an online reindex.
The output below from the LogReader log clearly shows many rows are being read while only a few are being replicated.
2008-11-11 23:41:35.527 The Log Reader Agent is scanning the transaction log for commands to be replicated. Approximately 500000 log records have been scanned in pass # 3, 142 of which were marked for replication, elapsed time 22535010 (ms).
Resolution:
This is unavoidable when large number of transaction for non-replicating table. If problem is caused by index maintenance log records, consider performing offline reindexing. Another option may be stopping and starting the online reindexing to create smaller batch of index update transactions.
Possible Cause: High number of VLFs:
A large number of Virtual Log Files (VLFs) can contributed to long running read times for the LogReader. To determine the number of VLFs, execute the following command and note the number of segments (rows) returned. Segment counts in 100K+ may be contributing to Log Reader Reader-Thread performance problems.
USE <published DB>
GO
DBCC LOGINFO
Resolution:
Shrink the transaction log to move extra VLFs. Set the database “growth” to value which allows for growing without creating high number of VLFs
907511 How to use the DBCC SHRINKFILE statement to shrink the transaction log file in SQL Server 2005
http://support.microsoft.com/default.aspx?scid=kb;EN-US;907511
Possible Cause: Slow Read IO:
Slow disk subsystem may be contribution to the long read times. Good overall measure of disk read performance is Windows Performance Monitor counter Avg Disk Sec/Read.
Resolution:
Read times in >15ms (.0015) may indicate an IO read bottleneck and should be investigated by the storage subsystem vendor.
Possible Cause: Slow Network IO:
For remote Distributors using a PULL Log Reader, the logread.exe is executing on the Distributor but Pulling rows from the Publishers transaction log. Unexpected network IO problem will contribute to the Reader-Thread latency.
Resolution:
Using Windows Performance monitor both the Publication and Distribution servers, check the Network Interface 'Output Queue Length' counter for the active NIC - the counter should consistently be under 2.
For more information please see the following MSDN topic titled 'Network Interface Object.'
http://msdn.microsoft.com/en-us/library/ms803962.aspx
Possible Cause: Blocking: High values in the Log Reader Agent -OUTPUT log for “Write time(ms):” indicate bottleneck writing commands to the Distribution database. Blocking may be caused by another SQL Replication Agent such as a the Distribution Cleanup Agent. Investigate blocking by the Log Reader Writer-Thread using SQL Server built in tools such as SQL Server Management Studio Activity Monitor, Performance Dashboard, Performance Database Warehouse, or stored procedures like sp_who.
For SQL 2005 download and install the Performance Dashboard. This provide a high-level look at the Distributor performance. Blocking, long running queries, high IO wait times are displayed in a graphical format. For SQL 2008, these reports are included in the Performance Data Warehouse.
SQL Server 2005 Performance Dashboard Reports
Resolution:
If the “head blocker” is the Distribution cleanup, consider stopping this Agent, allowing data to get replicated, then restart cleanup during off-peak hours. Blocking may indicate IO taking longer than expected to complete. See “High IO” for additional troubleshooting steps.
Possible Cause: High IO: Check the Profile trace collected on the Distributor for the Duration and CPU values for the executions of sp_MSadd_replcmds.
A high IO may indicate poor query plan. Using a captured Profile trace read the CPU and Duration for the completed events using the command below.
SELECT duration, starttime, endtime, textdata
FROM ::fn_trace_gettable('C:\DISTRIBUTOR_sp_trace.trc', DEFAULT)
WHERE TEXTDATA LIKE '%SP_MSadd_replcmds%' AND EVENTCLASS=10
Durations of 10x over the CPU times indicate resource contention. Running SQL Server Management Studio reports on the Distributor look for blocking of the Log Reader agent.
With SQL 2005 and 2008 system DMVs can also be used to retrieve the duration and CPU values for the Log Reader Writer-Thread. Low Duration and CPU but high Logical Reads may indicate a poor query plan caused by out of data table statistics. The command below will retrieve the query plan along with the execution statistics.
--dm_exec_query_stats for Log Reader Writer-Thread sp_MSadd_replcmds
-- by top total_worker_time
SELECT TOP 25
st.text, qp.query_plan,
(qs.total_logical_reads/qs.execution_count) as avg_logical_reads,
(qs.total_logical_writes/qs.execution_count) as avg_logical_writes,
(qs.total_physical_reads/qs.execution_count) as avg_phys_reads,
qs.*
FROM sys.dm_exec_query_stats as qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) as st
CROSS APPLY sys.dm_exec_query_plan(qs.plan_handle) as qp
WHERE st.text like 'CREATE PROCEDURE sp_MSadd_replcmds%'
ORDER BY qs.total_worker_time DESC
Go
Possible Cause: Slow WRITE IO:
Slow disk subsystem may be contribution to the long write times. Good overall measure of disk read performance is Windows Performance Monitor counter Avg Disk Sec/Write.
Resolution:
Write times in >20ms (.0020) may indicate an IO write bottleneck and should be investigated by the storage subsystem vendor.
Possible Cause: Network IO
Similar to situation above, you may want to investigate Network interface queue length should be confirmed.
Like the Log Reader, the Distribution Agent is 2 simultaneous conversation going. Delay in moving rows from the Distributor to the Subscriber may occur in either or both of these conversation. The Reader-Thread conversation is responsible for querying the Distribution database for the list of transactions to apply at the subscriber. It first queries the subscriber metadata table to find the last successfully replicated transactions. Next the Reader-Thread executes sp_MSget_repl_commands to begin reading rows from the Distribution database and populates this rows into an in-memory queue. As the rows get queued, the Distribution Agent Writer-Thread conversation begins writing these rows to the Subscriber. Like with the Log Reader, adding the Distribution Agent -OUPUT parameter will provide good overview of the Distribution Agent performance. As noted above, recent builds of SQL Server now automatically records this information in the Distribution Agent history table MSdistribution_history.
In this output the READS completes in 92ms at a rate of 565 cmds/sec while the WRITES are taking 14,000ms( 14 seconds ) and only processing 3.69 cmds/sec. This example clearly shows the WRITE are slower then reads.
************************ STATISTICS SINCE AGENT STARTED ***********************
Total Run Time (ms) : 18828 Total Work Time : 14110
Total Num Trans : 52 Num Trans/Sec : 3.69
Total Num Cmds : 52 Num Cmds/Sec : 3.69
Writer Thread Stats
Time Spent on Exec : 12063
Time Spent on Commits (ms): 422 Commits/Sec : 0.14
Time to Apply Cmds (ms) : 14110 Cmds/Sec : 3.69 <-- Write Thread
Time Cmd Queue Empty (ms) : 671 Empty Q Waits > 10ms: 2
Total Time Request Blk(ms): 671
Reader Thread Stats
Calls to Retrieve Cmds : 2
Time to Retrieve Cmds (ms): 92 Cmds/Sec : 565.22 <-- Reader Thread
Time Cmd Queue Full (ms) : 5486 Full Q Waits > 10ms : 3
*******************************************************************************
Possible Cause: Large Batch of Transactions
Latency problem are often caused when series of transactions are trying to move a large batch of commands to the Subscribers. The queries below show overall rows counts and index statistics for commands stored in the Distribution database.
--ROW COUNTS
--Look for high row counts (>1millions) indicate cleanup not running
-- or large pending transactions.
SELECT name, rowcnt, STATS_DATE (id, indid) as 'Last Update Stats'
FROM distribution.dbo.sysindexes
WHERE name IN('ucMSrepl_transactions', 'ucMSrepl_commands')
Are the row counts expected or do they now contain millions of rows? High rows counts (>1 million) may indicate a large transaction is being processes or cleanup procedure is not running.
When performance troubleshooting latency a review of pending commands by day by # commands may uncover helpful pattern. A breakdown of the commands being stored in the Distribution database can be retrieve by running the following quries.
-- Check the Time associated with those Transaction Counts into temp table
select t.publisher_database_id, t.xact_seqno,
max(t.entry_time) as EntryTime, count(c.xact_seqno) as CommandCount
into #results
FROM MSrepl_commands c with (nolock)
LEFT JOIN msrepl_transactions t with (nolock)
on t.publisher_database_id = c.publisher_database_id
and t.xact_seqno = c.xact_seqno
GROUP BY t.publisher_database_id, t.xact_seqno
-- Show each hour and number of commands per Day:
SELECT publisher_database_id
,datepart(year, EntryTime) as Year
,datepart(month, EntryTime) as Month
,datepart(day, EntryTime) as Day
,datepart(hh, EntryTime) as Hour
--,datepart(mi, EntryTime) as Minute
,sum(CommandCount) as CommandCountPerTimeUnit
FROM #results
GROUP BY publisher_database_id
,datepart(year, EntryTime)
,datepart(month, EntryTime)
,datepart(day, EntryTime)
,datepart(hh, EntryTime)
--,datepart(mi, EntryTime)
--order by publisher_database_id, sum(CommandCount) Desc
ORDER BY publisher_database_id, Month, Day, Hour
In the sample output below, a large batch of transactions were being replicated as result of table updates causing slowdown in the Distribution Agent.
publisher_database_id Year Month Day Hour CommandCountPerTimeUnit
--------------------- ----------- ----------- ----------- ----------- -----------------------
2 2009 5 14 10 132
2 2009 5 14 11 656
2 2009 5 14 12 880
2 2009 5 14 13 4379
2 2009 5 14 14 152
2 2009 5 14 15 1478
2 2009 5 14 20 161
2 2009 5 14 21 145
2 2009 5 15 6 1700
2 2009 5 15 7 3672
2 2009 5 15 8 6266
2 2009 5 15 9 329
2 2009 5 15 10 5678715
2 2009 5 15 11 5637959
2 2009 5 15 12 5281732
2 2009 5 15 13 5020950
2 2009 5 15 14 1252
2 2009 5 16 11 732
2 2009 5 16 12 178
2 2009 5 16 13 725
2 2009 5 16 14 186
2 2009 5 16 16 72
Resolution:
Large batch of transactions require heavy IO requirements by the Distribution Agent Reader-Thread on the distribution database. Fast disk subsystem with Transaction log and database on separate drives/LUNs may help improve IO performance. If this will be an ongoing pattern consider replicating the stored procedure EXECUTION instead of the RESULTS.
Publishing Stored Procedure Execution in Transactional Replication
http://msdn.microsoft.com/en-us/library/ms152754(SQL.90).aspx
Troubleshooting Reader Thread: Distribution Agent latency in the Reader Thread occurs when retrieving rows from the DistributionDB. Obtain Agent history or -OUTPUT and examine the Agent stats which are generated every 5 minutes.
************************ STATISTICS SINCE AGENT STARTED ***********************
03-22-2009 09:55:19
Total Run Time (ms) : 59189906 Total Work Time : 492741
Total Num Trans : 5 Num Trans/Sec : 0.01
Total Num Cmds : 5 Num Cmds/Sec : 0.01
Total Idle Time : 58660470
Writer Thread Stats
Total Number of Retries : 0
Time Spent on Exec : 0
Time Spent on Commits (ms): 16 Commits/Sec : 0.05
Time to Apply Cmds (ms) : 0 Cmds/Sec : 0.00
Time Cmd Queue Empty (ms) : 528717 Empty Q Waits > 10ms: 5
Total Time Request Blk(ms): 59189187
P2P Work Time (ms) : 0 P2P Cmds Skipped : 0
Reader Thread Stats
Calls to Retrieve Cmds : 11738
Time to Retrieve Cmds (ms): 492741 Cmds/Sec : 0.01
Time Cmd Queue Full (ms) : 0 Full Q Waits > 10ms : 0
*******************************************************************************
This -OUTPUT taken from the Distribution Agent log shows 100% of the duration was spent read rows from the Distribution database before the Agent was stopped.
The Agent Log also records time when events completed. In the example below, 4 minutes elapsed between time the Distribution Agent queries the Subscriber for the starting transaction to replicate and time it retrieved a batch of transactions from the Distribution database. The latency is caused by Reader Threads are waiting for CMDs to be returned from the Distribution DB.
--Example from Distribution Agent -OUTPUT showing delay in READER thread.
--Read for Subscription metadata completes
2009-05-16 01:26:49.229 OLE DB Distributor 'SQL380': {call sp_MSget_subscription_guid(11)}
--Four minutes later read of replication cmds completes.
2009-05-16 01:29:31.401 sp_MSget_repl_commands timestamp value is: 0x0x0000081a00000045000400000000
--Reader threads are waiting for CMDs to be returned from the Distribution DB.
To test the READER execute times, run sp_MSget_repl_commands from Query Window, look for Blocking, table scans, timeouts, etc.
1. From the -OUTPUT get the LSN for the long runing sp_MSget_repl_commands
2009-05-16 01:29:31.401 sp_MSget_repl_commands timestamp value is: 0x0x0000081a00000045000400000000
2. Get the Agent_ID from the same log
3. Execute sp_MSget_repl_commands query against Distribution DB
sp_MSget_repl_commands @agent_id=11, @last_xact_seqno=0x0000081a00000045000400000000
How long did this command take to execute? Was it blocked? Were they are large number of CMDs to be returned?
You can also retrieve the last successful transaction from the subscriber’s database using query below. The transaction_timestamp value contains the Log Sequence Number (LSN) used for the sp_MSget_repl_commands stored procedure.
USE <subscriber db>
--Retrieve last successful Transaction
-- Multiple rows per publication indicate parallel distribution streams
SELECT publisher,publication, distribution_agent,transaction_timestamp
FROM dbo.MSreplication_subscriptions
For even more details, examine the execution plan using query below.
USE distribution
GO
DBCC FREEPROCCACHE
DBCC DROPCLEANBUFFERS
SET STATISTICS PROFILE ON
SET STATISTICS IO ON
SET STATISTICS TIME ON
sp_MSget_repl_commands @agent_id=11, @last_xact_seqno=0x0000081a00000045000400000000
SET STATISTICS PROFILE OFF
SET STATISTICS IO OFF
SET STATISTICS TIME OFF
In SQL 2005/2008 we can pull the CACHED execution plans from the DMVs.
--dm_exec_query_stats for sp_MSget_repl_commands
-- by top total_worker_time
SELECT TOP 25
st.text, qp.query_plan,
(qs.total_logical_reads/qs.execution_count) as avg_logical_reads,
(qs.total_logical_writes/qs.execution_count) as avg_logical_writes,
(qs.total_physical_reads/qs.execution_count) as avg_phys_reads,
qs.*
FROM sys.dm_exec_query_stats as qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) as st
CROSS APPLY sys.dm_exec_query_plan(qs.plan_handle) as qp
WHERE st.text like 'CREATE PROCEDURE sys.sp_MSget_repl_commands%'
ORDER BY qs.total_worker_time DESC
Go
Possible Cause: Table Statistics
Long running Reader-Thread query when transactions containing many small batches may also indicate Replication system tables need to have statistics updates. Table statistics can become out-of-date if AutoUpdateStatistics has been disable for the Distribution database.
--Updated Replication table statistics 'Updated' date
DBCC SHOW_STATISTICS(MSrepl_commands,ucMSrepl_commands)
DBCC SHOW_STATISTICS(MSrepl_transactions,ucMSrepl_transactions)
--Retrieve Distribution database settings
-- Look for IsAutoUpdateStatistics
sp_helpdb distribution
--or (returns 1)
select DATABASEPROPERTYEX('distribution','IsAutoUpdateStatistics')
Resolution:
--Updated Replication table statistics
UPDATE STATISTICS MSrepl_transactions
WITH FULLSCAN
UPDATE STATISTICS MSrepl_commands
WITH FULLSCAN
Resolution:
Using built-in SQL tools, verify Distribution Agent Reader-Thread is not being blocked at the Distribuiton database.
The Writer-Thread calls Replication created stored procedures to apply changes to the subscriber. These sp’s rely on a user table’s unique index (usually primary key) to locate records to be updated or deleted. Latency in the Writer thread is usually see in long-running execution time for these stored procedures.
sp_MSIns_articlename
sp_MSUpd_articlename
sp_MSDel_articlename
Possible Cause: Long execution times in the INS/UPD/DEL stores procedures used to apply transactions to the subscriber.
In the sample output below, the Distribution Agent latency is occurring in the WRITER thread of the Distribution Agent Latency
--dm_exec_query_stats for sp_MSupd (Subscriber Updates)
-- by top total_worker_time
SELECT TOP 250
st.text, qp.query_plan,
(qs.total_logical_reads/qs.execution_count) as avg_logical_reads,
(qs.total_logical_writes/qs.execution_count) as avg_logical_writes,
(qs.total_physical_reads/qs.execution_count) as avg_phys_reads,
qs.*
FROM sys.dm_exec_query_stats as qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) as st
CROSS APPLY sys.dm_exec_query_plan(qs.plan_handle) as qp
WHERE st.text like '%sp_MSupd_dboemployee%' -- sp_MSupd_dbo<table name>%'
ORDER BY qs.total_worker_time DESC
Go
If the stored procedures have already been flushed from cache or SQL Server has been restarted, a Profile trace on the Subscriber taken during the problem will provide the summary execution data. Once loaded by SQL Nexus or fn_ReadTrace the stored procedure below will summaries the stored procedures called the Distribution Agent.
-- measure ALL spids for performance
select hashid, count(*) as Num, sum(duration) as sumdur, sum(cpu) assumcpu, sum(reads) as sumreads, sum(writes) as sumwrites, avg(duration) asavgdur, avg(cpu) as avgcpu, avg(reads) as avgreads, avg(writes) asavgwrites
INTO ReadTrace.topCPUBatch
FROM ReadTrace.tblBatches
group by hashid
order by sumcpu desc
GO
-- Get the text for the top CPU batches
select b.*,a.NormText
FROM ReadTrace.tblUniqueBatches a
JOIN ReadTrace.topCPUBatch b
ON a.hashid=b.hashid
where NormText like '%SP_MSGET_REPL_COMMANDS%'
--OR NormText like '%SP_MSHELP_DISTRIBUTION_AGENTID%'
--OR NormText like '%SP_MSGET_SUBSCRIPTION_GUID%'
OR NormText like '%SP_MSins%'
OR NormText like '%SP_MSupd%'
OR NormText like '%SP_MSdel%'
order by sumcpu desc
Example Output
hashid Num sumdur sumcpu sumreads sumwrites avgdur avgcpu avgreads avgwrites NormText
-------------------- --- ------- ----------- ----------- ----------- ------- ------ ----------- ----------- -------------------------------
-5684440255897011116 69 966058 389 4071 0 14000 5 59 0 SP_MSGET_REPL_COMMANDS
-2908468862153496396 69 1466258 204 5175 0 21250 2 75 0 SP_MSHELP_DISTRIBUTION_AGENTID
-7220247869142477571 69 539588 31 1152 0 7820 0 16 0 SP_MSGET_SUBSCRIPTION_GUID
Possible Cause:
In the sample output below, the Distribution Agent latency is occurring in the WRITER thread of the Distribution Agent Latency
************************ STATISTICS SINCE AGENT STARTED ***********************
Total Run Time (ms) : 18828 Total Work Time : 14110
Total Num Trans : 52 Num Trans/Sec : 3.69
Total Num Cmds : 52 Num Cmds/Sec : 3.69
Writer Thread Stats
Time Spent on Exec : 12063
Time Spent on Commits (ms): 422 Commits/Sec : 0.14
Time to Apply Cmds (ms) : 14110 Cmds/Sec : 3.69
Time Cmd Queue Empty (ms) : 671 Empty Q Waits > 10ms: 2
Total Time Request Blk(ms): 671
Reader Thread Stats
Calls to Retrieve Cmds : 2
Time to Retrieve Cmds (ms): 92 Cmds/Sec : 565.22
Time Cmd Queue Full (ms) : 5486 Full Q Waits > 10ms : 3
*******************************************************************************
In this output the READS completes in 92ms at a rate of 565 cmds/sec while the WRITES are taking 14,000ms( 14 seconds ) and only processing 3.69 cmds/sec. This example clearly shows the WRITE are slower then reads.
Possible Cause: User Defined Triggers: High values in the Log Reader Agent -OUTPUT log for “Write time(ms):” indicate bottleneck writing commands to the Distribution database. Once possible cause are user defined triggers adding theFirst investigate blocking caused by other Replication Agent.
Possible Cause: Not Replicating SQL statements as “Parameters”
Obtain the create publication script and check ALL articles to see if the SQL statements are being sent as batch of 'parameters' statements. The status = 8 indicates 'parameters' batching is not enabled.
exec sp_addarticle @publication = ... @status = 24
or
select name, status from sysarticles
Article status can be updated on-demand using the following Replication stored procedure. This should be executed on all articles.
EXEC sp_changearticle @publication = N'<pub name>', @article = N'<article name>',
@property = 'status', @value = 'parameters'
Capture Profile RPC:COMPLETE events on the Subscriber. Look for the time difference between StartTime and EndTime and/or Duration. Statements should be completing around 100ms. Longer execution times for single row update/inserts may indicate blocking on the subscriber or user-defined triggers on the subscriber tables contributing to the high durations.
Look at the time difference between when on RPC:COMPLETES and the next PRC:STARTING. This is the "round trip" time back to the Distribution Agent to say that command completed and next commands is received. Times difference larger then 500ms may indicate network latency.
Also examine time from when READER thread finished reading a batch of commands until these commands were applied at the Subscriber. Long times between these 2 events also indicate long WRITER thread.
--reader thread completed -OUTPUT entry
2009-05-14 01:07:50.795 sp_MSget_repl_commands timestamp value is: 0x0x0000081a00000033000400000000
--writeer thread commit -OUTPUT entry
2009-05-14 01:13:50.011 Parameterized values for above command(s): {{N'ZZZZZZZZ', 1}, {N'XXXXXXXX', 2}, {N'YYYYYYYY', 3}}
When troubleshooting long Writer Thread look focus on the time to execute the UPDATE/INSERT/DELETEs on the Subscriber. Execution times can be found in either PSSDIAG/Profiler trace or from execution DMVs.
Execution time (ms): 3401930
Work time (ms): 3401470
Distribute Repl Cmds Time(ms): 3354520
Fetch time(ms): 3343920
Repldone time(ms): 5520
Write time(ms): 140070
Num Trans: 8 Num Trans/Sec: 0.002352
Num Cmds: 477219 Num Cmds/Sec: 140.297871
转:http://blogs.msdn.com/b/chrissk/archive/2009/05/25/transactional-replication-conversations.aspx