XDC Namespace create/update events are not getting replicated/applied

Hi,

We have an ongoing issue with the current cluster (1.17.1) and the XDC which manifests in a following ways:

ClusterA to ClusterB is broken only for the namespace related events, all the workflow events are flowing

  1. New namespace X created in the ClusterA with the ClusterA (active) - ClusterB is not replicated to the ClusterB so ClusterB does not have a namespace
  2. Events for the namespace X workflows are getting replicated to the ClusterB and ClusterB receives an error namespace does not exists

ClusterB to ClusterA seem to be 1/2 working

  1. New namespace Y created in the ClusterB with the ClusterB(active) - ClusterA is replicated to the ClusterA
  2. When namespace Y is failed over to the ClusterA from the ClusterB is has on the ClusterB active in ClusterB and in ClusterA active in ClusterA as change confirmation was never processed by the ClusterB

There are no leads in the logs as well to what might be causing such issue.

Could someone help getting through this without having to zero the clusters?

Regards,
Andrey

when we have enabled the debug our primary cluster is receiving the updates if we update the NS configuration

{"level":"debug","ts":"2022-08-19T11:51:09.848Z","msg":"Successfully fetched namespace replication tasks","service":"worker","component":"replicator","component":"replication-task-processor","xdc-source-cluster":"cdt-westeurope-01-secondary","counter":1,"logging-call-at":"namespace_replication_message_processor.go:162"}

where our secondary cluster is not detecting any events from the primary when we update NS on the primary and the count stays zero

{"level":"debug","ts":"2022-08-19T11:54:19.786Z","msg":"Successfully fetched namespace replication tasks","service":"worker","component":"replicator","component":"replication-task-processor","xdc-source-cluster":"cdt-westeurope-01-primary","counter":0,"logging-call-at":"namespace_replication_message_processor.go:162"}

We are able to workaround the issue by generating the number of the messages and bumping the message ID beyond of what temporal considers as already requested.

I suspect there should be at least one message kept in the queue for the ID to iterate and the messages being cleaned resulted in the queue counter going back to zero same time the queue_metadata counter remained intact and set to 52 for the secondary cluster. Queue table goes empty if we kill all the history servers but the queue_metadata table is never cleaned up leaving the counter set to the value last requested by the cluster peer. If message ID is computed in a way I suspect it is then if the queue table gets emptied then the metadata should also follow or queue table should be preserved otherwise the problem is likely to repeat.

@tihomir - let us know if you like us to raise an issue on GitHub for that?

@Andrey_Dubnik sorry for late response, am checking with server team to get some input on this. Also looking if the message ID is calculated and will get back on that too.

No problem at all. We just like to help making the product even better :slight_smile:

As a mitigation we have introduced the PDBs into the system and the cluster maintenance should have less impact on the Temporal stability but still if on any event all the history servers are killed at the same time XDC breaks due to the issue described above.

Workaround is updating the queue_metadata table record and reset the counter but it requires the monitoring for the specific condition of queue being empty with the queue_metadata counter defined and set to >0 value.

@tihomir did you get anything back from the server team by chance?

Hi @Andrey_Dubnik we are still looking into it