Multi-Cluster Replication Performance Tuning

hferentschik · July 8, 2024, 8:45pm

Hi there,

I am trying to setup multi-cluster replication following the docs.
After enabling the cluster connections and upserting the namespace configuration for the namespace, replication starts, but in regular intervals I am seeing spikes in the eplication_tasks_lag and replication_latency. Does that mean that replication cannot keep up with the rare in which workflows/activities are created in the active cluster? If so, what are my main options?

In another post I found that history.ReplicationTaskFetcherAggregationInterval, history.ReplicatorTaskBatchSize and history.ReplicationTaskProcessorHostQPS are the main options to control the lag. Is this correct. I tried to go from 2 s to 500 ms for the aggregation interval and I doubled the host qps from 1500 to 3000. I did not notice a obvious improvement. Is it safe to increase these values more aggressively? Is there a risk that replication requests to the active cluster have impact and actual user requests?

Also, is there anything more needed for an initial replication of a given namespace. This post seems to imply that sleeping workflows or workflows waiting for signals won’t be directly replicated. I am a bit surprised by this, but at the same time there seems to be a ForceRepplicationWorkflow in the Temporal codebase. Do I need to execute this when starting the replication? If so, is there a tctl or temporal admin command for this or do I somehow start the workflow directly?

Sorry, lots of questions, but I seem to be missing some pieces of the puzzle here. Thanks in advance for any help.

–Hardy

hferentschik · July 9, 2024, 3:51pm

Turns out that one bottle neck was that the additional load on history persistence was exceeding the history.persistenceMaxQPS limit and also the underlying Cassandra database had troubles keeping up with the increased read load. Increasing the limit and scaling the database improves the situation.

Topic		Replies	Views
Completed Workflows not replicating in multi-cluster set-up when one of cluster is Rebuilt Server Deployment	3	248	March 18, 2025
What is the correct way to disable & re-enable multi-cluster replication? Community Support multicluster , server	3	760	May 16, 2023
Running into problems testing multi-cluster replication locally via docker Community Support xdc	6	958	September 23, 2022
Replication crashes history service Server Deployment replication	11	93	October 2, 2024
Workflow Performance with Java SDK Community Support java-sdk	1	742	February 20, 2023

Multi-Cluster Replication Performance Tuning

Related topics