Multi-Cluster Replication Performance Tuning

Hi there,

I am trying to setup multi-cluster replication following the docs.
After enabling the cluster connections and upserting the namespace configuration for the namespace, replication starts, but in regular intervals I am seeing spikes in the eplication_tasks_lag and replication_latency. Does that mean that replication cannot keep up with the rare in which workflows/activities are created in the active cluster? If so, what are my main options?

In another post I found that history.ReplicationTaskFetcherAggregationInterval, history.ReplicatorTaskBatchSize and history.ReplicationTaskProcessorHostQPS are the main options to control the lag. Is this correct. I tried to go from 2 s to 500 ms for the aggregation interval and I doubled the host qps from 1500 to 3000. I did not notice a obvious improvement. Is it safe to increase these values more aggressively? Is there a risk that replication requests to the active cluster have impact and actual user requests?

Also, is there anything more needed for an initial replication of a given namespace. This post seems to imply that sleeping workflows or workflows waiting for signals won’t be directly replicated. I am a bit surprised by this, but at the same time there seems to be a ForceRepplicationWorkflow in the Temporal codebase. Do I need to execute this when starting the replication? If so, is there a tctl or temporal admin command for this or do I somehow start the workflow directly?

Sorry, lots of questions, but I seem to be missing some pieces of the puzzle here. Thanks in advance for any help.

–Hardy

Turns out that one bottle neck was that the additional load on history persistence was exceeding the history.persistenceMaxQPS limit and also the underlying Cassandra database had troubles keeping up with the increased read load. Increasing the limit and scaling the database improves the situation.