databasesql server

What if you encounter Error number: 20598 with SQL Server replication ?

Today on one of the sql server I encountered an interesting scenario, one of the dbas accidentally deleted data from subscriber and then when he realized he turned around and delete the data from publisher. Crap ! is the diplomatic term to use here.  That too it was a huge table and part of a 50 table article. The moment he did that transactional replication stopped throwing the Error number: 20598. Which means the distributor is not able to deliver the delete statement to the subscriber since it cannot find the row.  Once this happens all the subsequent transactions just pile up. You got few ways to handle it:

1. Reinitialize the subscription (after giving the dba a kick in the you know what)
2. If replication must continue while you identify the source of the error, specify the parameter -SkipErrors 20598 for the Distribution Agent and either fix the subscriber table.
3. Manually insert the deleted rows to subscriber table, so the deletes can flow through
4. If all of the above can’t be done, then delete the offending commands from the MSrepl_commands table. (again give the dba a kick in the you know what !)

The last option should be done with care since you are messing with replication commands that the distributor has pending. The table distribution.dbo.MSrepl_commands table has all the commands the distributor has send and going to send to subscriber. If you delete more than what you need now you would have data inconsistency between publisher and subscriber.  In this post I am going to show you how you can identify the offending command and delete them from the MSrepl_commands table.

First lets find the errors that the distributor has encountered . You can find this either from the Replication monitor or by querying dbo.MSrepl_errors.  From the error_text column you should be able to identify the xact_seqno and command_id . Get the list of the xact_seqno and command_id that is causing the issue.

Second lets see what those commands are . For this execute the sp_browsereplcmds passing the xact_seqno. This will give you the list of all the commands that the distributor is trying to execute. Confirm the once you need to delete.
eg: sp_browsereplcmds @xact_seqno_start = ‘0x0000002A000000D3000700000000’

Third  delete the rows from the MSrepl_commands table so that replication can continue with other pending ones. Be very cautious on what rows you delete. You do not want to delete rows that are not causing the issue.
eg:  delete from dbo.MSrepl_commands
where xact_seqno = 0x0000002A000000D3000700000000 and command_id in (1,2)

Once you delete the offending rows from the MSrepl_commands, replication should start sending the commands to subscriber. You the replication monitor to see whether the undistributed commands queue is decreasing.

Use this approach at your own risk.