Hi Team, we have recently setup multicluster environment and replicated data from one cluster to another with failover. We observed that after failover replication, there is high CPU Usage and on further investigating, we saw high number of SignalWorkflowExecution operation requests and they are failing with serviceerror_NotFound. Also we re seeing resource exhausted errors with cause as BusyWorkflow. We want to understand if replication under the hood use Signals in any way? If not can you please help us understand what could be the cause or any pointers to look further into.
We want to understand if replication under the hood use Signals in any way?
No, it doesnt. Passive cluster shards poll active cluster for replication tasks, but yeah not signal api used.
Do you have client and sdk worker metrics configured? If you do check which pod/containers sees high rate of temporal_request_failure
metric for operation SignalWorkflowExecution, maybe that could give you idea of who is sending the signals. If you have an LB/proxy configured between your client/workers and your frontend services would look at its logs to see if it logs ip address of called for SignalWorkflowExecution api calls where service returns not_found grpc code in response
1 Like