SQL 2008 replication error ” retrying command or the process is running and waiting for a response from the server “
Posted by rahmanagoro on August 26, 2010
I was working with SQL 2008 replication and on checking replication monitor, I noticed that on the distribution agent for one of the databases, it had a retrying command status. I further looked into it to see what the problem was, here I can see that the distribution agent had become stucked whilst trying to send commands to the subscriber. The first thing that came to my mind was, “what is happening with the subscriber ? ”
use master GO sp_who2 active
I was able to see that there is a blocking spid at the moment, which is coming from my distributor. So what must have happened here is that, as a snapshot was being delivered to the subscriber, objects such as stored procedures and functions were also being delivered as well, thereby causing a deadlock to occur.
To confirm this, I ran the code below to check for replication errors.
select * from distribution..MSrepl_errors
I had entries like
1092 2010-08-01 00:35:16.560 0 0 MSSQL_ENG HYT00 Query timeout expired 0x0001262200000210000100000002 1 0
1092 2010-08-01 00:35:16.557 0 1 MSSQL_ENG if @@trancount > 0 rollback tran 0x0001262200000210000100000002 1 0
On the subscriber, I confirmed the locking scenario by looking at the wait type for the replication SPID, and it was a LCK_M_S wait type, which confirmed it was a deadlocking situation.
I also ran a deadlocking detail script as well.
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED SELECT owt.session_id AS waiting_session_id, DB_NAME(tls.resource_database_id) AS database_name, owt.wait_duration_ms, -- owt.waiting_task_address, owt.wait_type, -- tls.resource_associated_entity_id, -- tls.resource_description AS local_resource_description, tls.resource_type, (CASE tls.resource_type WHEN 'OBJECT' THEN OBJECT_NAME(tls.resource_associated_entity_id, tls.resource_database_id) WHEN 'DATABASE' THEN DB_NAME(tls.resource_database_id) ELSE (SELECT OBJECT_NAME(pat.[object_id], tls.resource_database_id) FROM sys.partitions pat WHERE pat.hobt_id = tls.resource_associated_entity_id) END ) AS object_name, tls.request_mode, tls.request_type, tls.request_session_id, owt.blocking_session_id, owt.resource_description AS blocking_resource_description, (SELECT SUBSTRING(est.[text], ers.statement_start_offset/2 + 1, (CASE WHEN ers.statement_end_offset = -1 THEN LEN(CONVERT(nvarchar(max), est.[text])) * 2 ELSE ers.statement_end_offset END - ers.statement_start_offset ) / 2) FROM sys.dm_exec_sql_text(ers.[sql_handle]) AS est) AS waiting_query_text, CASE WHEN owt.blocking_session_id > 0 THEN ( SELECT est.[text] FROM sys.sysprocesses AS sp CROSS APPLY sys.dm_exec_sql_text(sp.[sql_handle]) as est WHERE sp.spid = owt.blocking_session_id) ELSE NULL END AS blocking_query_text, qp.query_plan AS waiting_query_plan FROM sys.dm_tran_locks AS tls INNER JOIN sys.dm_os_waiting_tasks owt ON tls.lock_owner_address = owt.resource_address INNER JOIN sys.dm_exec_requests ers ON tls.request_request_id = ers.request_id AND owt.session_id = ers.session_id OUTER APPLY sys.dm_exec_query_plan(ers.[plan_handle]) AS qp GO
and the other spid only had a begin tran statement showing from my dbcc inputbuffer statement.
I killed this SPID which had begin tran as the sql statement which it was running, and the replication commands were delivered and replication was working once more. The problem was fixed.