Resource exhausted BusyWorkflow

Hello, I am looking for some advice or hints to figure out what the following errors are indicating that we need to increase. This is a Temporal cluster 1.21.2 in EKS usign Aurora postgres as a backend
This metric

sum(rate(service_errors_resource_exhausted[2m])) by (operation,resource_exhausted_cause,service_name)

is showing a high error rate of

{operation="AddActivityTask", resource_exhausted_cause="BusyWorkflow", service_name="matching"}
{operation="AddWorkflowTask", resource_exhausted_cause="BusyWorkflow", service_name="matching"}
{operation="QueryWorkflow", resource_exhausted_cause="BusyWorkflow", service_name="history"}
{operation="RecordActivityTaskStarted", resource_exhausted_cause="BusyWorkflow", service_name="history"}
{operation="RecordChildExecutionCompleted", resource_exhausted_cause="BusyWorkflow", service_name="history"}
{operation="RecordWorkflowTaskStarted", resource_exhausted_cause="BusyWorkflow", service_name="history"}
{operation="ScheduleWorkflowTask", resource_exhausted_cause="BusyWorkflow", service_name="history"}
{operation="StartWorkflowExecution", resource_exhausted_cause="BusyWorkflow", service_name="history"}

and at the same time, matching is throwing errors for

{"msg":"history client encountered error","service":"matching","error":"Activity task already started.","service-error-type":"serviceerror.TaskAlreadyStarted","logging-call-at":"metric_client.go:104"}                           
{"msg":"history client encountered error","service":"matching","error":"Workflow is busy.","service-error-type":"serviceerror.ResourceExhausted","logging-call-at":"metric_client.go:104"} 

and history

{"level":"info","ts":"2023-07-20T20:58:18.964Z","msg":"history client encountered error","service":"history","error":"Workflow is busy.","service-error-type":"serviceerror.ResourceExhausted","logging-call-at":"metric_client.go:104"}                                          
{"level":"info","ts":"2023-07-20T20:58:19.566Z","msg":"matching client encountered error","service":"history","error":"Workflow is busy.","service-error-type":"serviceerror.ResourceExhausted","logging-call-at":"metric_client.go:219"}

and no errors in the frontend.
Dynamic config only overrides the following:

    history.rps:
      - value: 4000

    matching.rps:
      - value: 3000

    matching.persistenceMaxQPS:
      - value: 4000

and frontend, history and matching they have 2 replicas each, all of them are using cpu and memory very well within their capacities (20-30%). The Aurora database it’s a r6i.2xlarge with around 45% cpu usage.

1 Like

Hello @vcardenas , were you able to solve these issues? If yes what was the issue and how you solved it? Thanks

1 Like

Seeing a similar phenomenon

BusyWorkflow

Resource exhausted on BusyWorkflow means workflow lock cannot be acquired in time (500ms). Each update to an execution is done under a lock.

Can you please check latency for HistoryCacheGetOrCreate operation:

histogram_quantile(0.99, sum(rate(cache_latency_bucket{operation=“HistoryCacheGetOrCreate”}[1m])) by (le))

see if latencies align with your resource exhausted graph.
Typically this can happen if you schedule a very large number of activities/child workflows at once, and/or large number of activities that all heartbeat at high rate.
If that’s the case its recommended to start a smaller number of activities/child workflows per single workflow execution, or start them across multiple workflow executions if possible.

We are also seeing “msg”:“history client encountered error”,“service”:“history”,“error”:“Workflow is busy.”,“service-error-type”:“serviceerror.ResourceExhausted”.
We are creating only one child per parent workflow. What are other factors to reduce workflow mutable state cache lock latencies?