Hi there,
I am trying to setup multi-cluster replication following the docs.
After enabling the cluster connections and upserting the namespace configuration for the namespace, replication starts, but in regular intervals I am seeing spikes in the eplication_tasks_lag
and replication_latency
. Does that mean that replication cannot keep up with the rare in which workflows/activities are created in the active cluster? If so, what are my main options?
In another post I found that history.ReplicationTaskFetcherAggregationInterval
, history.ReplicatorTaskBatchSize
and history.ReplicationTaskProcessorHostQPS
are the main options to control the lag. Is this correct. I tried to go from 2 s to 500 ms for the aggregation interval and I doubled the host qps from 1500 to 3000. I did not notice a obvious improvement. Is it safe to increase these values more aggressively? Is there a risk that replication requests to the active cluster have impact and actual user requests?
Also, is there anything more needed for an initial replication of a given namespace. This post seems to imply that sleeping workflows or workflows waiting for signals won’t be directly replicated. I am a bit surprised by this, but at the same time there seems to be a ForceRepplicationWorkflow in the Temporal codebase. Do I need to execute this when starting the replication? If so, is there a tctl
or temporal
admin command for this or do I somehow start the workflow directly?
Sorry, lots of questions, but I seem to be missing some pieces of the puzzle here. Thanks in advance for any help.
–Hardy