XDC Namespace create/update events are not getting replicated/applied

Andrey_Dubnik · August 18, 2022, 3:10pm

Hi,

We have an ongoing issue with the current cluster (1.17.1) and the XDC which manifests in a following ways:

ClusterA to ClusterB is broken only for the namespace related events, all the workflow events are flowing

New namespace X created in the ClusterA with the ClusterA (active) - ClusterB is not replicated to the ClusterB so ClusterB does not have a namespace
Events for the namespace X workflows are getting replicated to the ClusterB and ClusterB receives an error namespace does not exists

ClusterB to ClusterA seem to be 1/2 working

New namespace Y created in the ClusterB with the ClusterB(active) - ClusterA is replicated to the ClusterA
When namespace Y is failed over to the ClusterA from the ClusterB is has on the ClusterB active in ClusterB and in ClusterA active in ClusterA as change confirmation was never processed by the ClusterB

There are no leads in the logs as well to what might be causing such issue.

Could someone help getting through this without having to zero the clusters?

Regards,
Andrey

Andrey_Dubnik · August 19, 2022, 11:54am

when we have enabled the debug our primary cluster is receiving the updates if we update the NS configuration

{"level":"debug","ts":"2022-08-19T11:51:09.848Z","msg":"Successfully fetched namespace replication tasks","service":"worker","component":"replicator","component":"replication-task-processor","xdc-source-cluster":"cdt-westeurope-01-secondary","counter":1,"logging-call-at":"namespace_replication_message_processor.go:162"}

where our secondary cluster is not detecting any events from the primary when we update NS on the primary and the count stays zero

{"level":"debug","ts":"2022-08-19T11:54:19.786Z","msg":"Successfully fetched namespace replication tasks","service":"worker","component":"replicator","component":"replication-task-processor","xdc-source-cluster":"cdt-westeurope-01-primary","counter":0,"logging-call-at":"namespace_replication_message_processor.go:162"}

Andrey_Dubnik · August 19, 2022, 12:50pm

We are able to workaround the issue by generating the number of the messages and bumping the message ID beyond of what temporal considers as already requested.

I suspect there should be at least one message kept in the queue for the ID to iterate and the messages being cleaned resulted in the queue counter going back to zero same time the queue_metadata counter remained intact and set to 52 for the secondary cluster. Queue table goes empty if we kill all the history servers but the queue_metadata table is never cleaned up leaving the counter set to the value last requested by the cluster peer. If message ID is computed in a way I suspect it is then if the queue table gets emptied then the metadata should also follow or queue table should be preserved otherwise the problem is likely to repeat.

@tihomir - let us know if you like us to raise an issue on GitHub for that?

tihomir · August 19, 2022, 6:39pm

@Andrey_Dubnik sorry for late response, am checking with server team to get some input on this. Also looking if the message ID is calculated and will get back on that too.

Andrey_Dubnik · August 19, 2022, 10:08pm

No problem at all. We just like to help making the product even better

Andrey_Dubnik · August 30, 2022, 2:32pm

As a mitigation we have introduced the PDBs into the system and the cluster maintenance should have less impact on the Temporal stability but still if on any event all the history servers are killed at the same time XDC breaks due to the issue described above.

Workaround is updating the queue_metadata table record and reset the counter but it requires the monitoring for the specific condition of queue being empty with the queue_metadata counter defined and set to >0 value.

@tihomir did you get anything back from the server team by chance?

tihomir · August 30, 2022, 3:39pm

Hi @Andrey_Dubnik we are still looking into it

Topic		Replies	Views
Xdc replication issue after the cluster rebuild Community Support	2	460	July 20, 2022
For how long events are kept in the replication queue for the XDC? Community Support xdc	1	476	August 24, 2022
XDC replication into the new cluster for the in-flight workflows Community Support xdc	3	431	May 30, 2022
XDC Replication in Practice Community Support java-sdk , xdc	26	3554	September 18, 2023
Whats optimal settings for quick xdc replication? what metrics can one track? Community Support xdc	12	1181	July 8, 2024

XDC Namespace create/update events are not getting replicated/applied

Related topics