We recently had a production release where I had to add two new subscribers to the replication set, the idea was that we were moving onto new hardware but we couldn’t delete the two older subscribers, as we had to leave the subscribers running for a few more weeks just in case we had issues with the two new database servers in the new data center. The strategy was that should we encounter issues in the new data center, we would just re-route our applications to connect to the two subscribers in the old data center.
After adding the two new subscriber, meaning we had six subscribers in total, I started seeing errors within the replication snapshot agent. The error was as follows
The replication agent has not logged a progress message in 10 minutes. This might indicate an unresponsive agent or high system activity. Verify that records are being replicated to the destination and that connections to the Subscriber, Publisher, and Distributor are still active.
I noticed that the only article Microsoft had on this was on the link, http://technet.microsoft.com/en-us/library/ms152484.aspx. It was just telling me something that the message already says, the page didn’t detail a proper workaround, one thing with our replication topology is that we use the same server as the OLTP, publisher and distributor and whenever the replication snapshot agents failed, its normally at busy periods of the day, but I couldn’t still nail down the problem, I then decided to add some form of verbose logging on the snapshot agent job to see what the problem was, but by the time I got round to doing this, I was told that we can now remove the two extra subscribers, after removing the two extra subscribers the errors stopped. It was also difficult for me to recreate the problem on the test environment because we didn’t have six subscribers on test environment; we only had a maximum of two and besides this issue only started after adding two extra subscribers.
I also looked at the idea of changing the heartbeat interval, but this didn’t resolve the problem as this merely changed the details of the error message to reflect the new interval period specified.
It’s a weird one, but one which I hope Microsoft will be able to provide a more accurate workaround.