Occasionally in the replication newsgroups questions come up regarding transactions in transactional replication. People want to know what happens if a transaction on the publisher fails - does it still get applied at the subscriber? Or what if it runs on the publisher and fails on the subscriber - does it always/ever roll back? Then there are questions regarding the transaction size. If an update statement affects say 100000 rows on the publisher, this will be logged as 100000 updates in the transaction log. In some such cases the log reader agent may not be able to cope, and even if the log reader succeeds, there may be similar problems for the distribution agent. Users are interested in the possible workarounds for these situations. There is only a little information in BOL to help clarify and solve these types of query, so I have tied my various findings into an article to hopefully help out.
We start with a simple case. Suppose we add 4 records in a transaction at the publisher.
set xact_abort on
begin tran
insert 1
insert 2
insert 3
insert 4
commit tran
The log reader agent reads the transaction log and adds these records to the distribution database in the MSrepl_commands and MSrepl_transactions tables. If we use sp_browsereplcmds to check the actual text of these inserts we see that the transactional nature of this batch is maintained and we'll see something like this:
xact_seqno | Command | Command_Id |
seq_aaa | Command1 | 1 |
seq_aaa | Command2 | 2 |
seq_aaa | Command3 | 3 |
seq_aaa | Command4 | 4 |
The constant xact_seqno identifies the transaction, and the incrementing command_id indicates the commands in the batch. So, transactions on the publisher are applied as transactions on the subscriber.
What happens to the replication setup when a transaction fails at the publisher - does it still get replicated? Suppose we perform the same transaction as above, but "Command4" fails. On the publisher there will be an error message, the transaction is rolled back because of the xact_abort setting and no rows are added. Using sp_browsereplcmds, none of these commands are visible and they are not to be found on the subscriber, showing that only committed transactions are replicated to the subscriber.
What if the transaction completes at the publisher and subsequently fails at the subscriber? This could occur because we haven't been vigilant, and someone has been allowed to edit the subscriber data (yes - it happens :)). In this case the distribution agent fails and raises an error message. The failure itself is expected, but the inbuilt error handler has a hard coded exception command: "if @@trancount > 0 rollback tran" (which is actually independant of the xact_abort setting on the publisher) meaning that transactions failing at the subscriber are automatically rolled back. So, until the problem is fixed, the error will prevent the distribution agent from processing any separate part of the transaction - the ACID properties are maintained. Note that this is the default behaviour which may be overridden (see later).
The next thing to look at is the issue of large transactions. Remember that all commands are implicitly run as transactions so although a TSQL statement might look innocuous at first glance - e.g. a simple update statement - if it updates 100000 rows, this will be logged as:
begin tran
update row 1
update row 2
update row 3
...
update row 100000
commit tran
The log reader agent will try to read each of these commands and transfer them to the distribution database. During this process, what problems can occur?
Basically we might have some sort of timeout of the log reader agent. The resulting messages are various and include:
"The process could not execute 'sp_replcmds' on 'xxxxxx'."
"Status: 2, code: 0, text: 'Timeout expired'"
"A time out occurred while waiting for memory resources to execute the query."
"Agent 'xxx' is retrying after an error. 0 retries attempted. See agent job history in the Jobs folder for more details."
"The step was cancelled (stopped) as the result of a stop job request"
These errors are each related to the number of records marked for replication in the log which the log reader has to parse and process. There are a few profile parameters which can be used to modify the process and avoid the error:
(a) increase the QueryTimeout. This is the usual solution. It won't decrease the latency of a big transaction, but it is often sufficient on its own to get things working again.
(b) change the MaxCmdsInTran. If the problem is that the transaction contains a huge amount of commands, then it could be an advantage to chop it up into several smaller transactions. This will massively decrease latency and prevent associated timeouts. However it will cause a problem if there is a command within the transaction which fails. Remember that a transaction gets implicitly rolled back on the subscriber when there is an error. If we use this MaxCmdsInTran parameter, the child batch which has the problem row will error causing the distribution agent to fail. However if there are previous child batches they will have already been committed and hence broken the ACID properties of our original transaction. Therefore there is a definite risk associated with using this parameter.
(c) change the ReadBatchSize. This is the maximum number of transactions read out of the transaction log of the publishing database "per processing cycle". As there is no supported way to configure the "processing cycle" (changing -LogScanThreshold is going to be unsupported) we can equate this to saying "it is the number of transactions read when running the log reader agent". The lowest this figure can be is 1, so if there is one huge transaction causing the problem, this setting will not help at all to remove a timeout. However if the timeout is as a result of a lot of small transactions being processed, setting this to a smaller figure than the default of 500 to temporarily clear the backlog, along with increasing the QueryTimeout, is known to remove the problem.
(d) change the ReadBatchThreshold. This is very similar to the -ReadBatchSize parameter above but rather than relating to transactions, it relates to commands ie it is the number of replication commands to be read from the transaction log in a processing batch. Likewise, setting this to a small value along with increasing the QueryTimeout can help remove a backlog, especially if the problem is caused by large transactions.
Just like with the log reader agent we can use the QueryTimeout parameter, and increasing it's value will be the first option to try. In the case of the distribution agent though we have 2 extra parameters: CommitBatchSize which is "the number of transactions to be issued to the Subscriber before a COMMIT statement is issued. The default is 100." and CommitBatchThreshold which is "the number of replication commands to be issued to the Subscriber before a COMMIT statement is issued. The default is 1000." - both definitions coming from BOL.
These definitions are really quite confusing. Does this mean that the ACID properties are necessarily broken if we have a transaction that has > 1000 commands in it? What is the "COMMIT" being referred to? Empirically I find that is that this is not an option to break up a transaction into a series of smaller transactions, so the word "COMMIT" in the BOL explanations is misleading and really doesn't refer to the original transaction.
Consider an example where 4 rows are added and the 4th row causes an error:
begin tran
insert row 1
insert row 2
insert row 3
insert row 4 -- causes an error
commit tran
If we set the CommitBatchSize and CommitBatchThreshold to both be 2, and the final insert fails then what happens? If the ACID properties were broken and replication had produced 2 transactions at the subscriber, there would be 2 rows from the initial successful transaction at the subscriber. In practice no rows at all are inserted at the subscriber. So, the logic of the original transaction always remains.
It seems to me that that the usefulness of these 2 commands relates to those cases involving blocking issues at the subscriber. Smaller batches incur reduced locking to be applied in each separate batch at the subscriber. If there was no subscriber access other than that of the distribution agent, the transaction will overall take longer to apply if we use small batches. However in the presence of other processes needing to perform reads or updates of committed data at the subscriber (i.e. I'm not referring to dirty reads) then blocking issues might mean that smaller batches will overall work more quickly. Clearly you'd need to really understand the usage profile of the subscriber to determine if these parameters are going to improve throughput and concurrency. In practice I've never used these parameters but I suppose I know they're there if required.
Hopefully this article helps clarify how transactions are applied in transactional replication, and how this knowledge can be used to optimise the system