Resource exhausted BusyWorkflow

vcardenas · July 20, 2023, 9:06pm

Hello, I am looking for some advice or hints to figure out what the following errors are indicating that we need to increase. This is a Temporal cluster 1.21.2 in EKS usign Aurora postgres as a backend
This metric

sum(rate(service_errors_resource_exhausted[2m])) by (operation,resource_exhausted_cause,service_name)

is showing a high error rate of

{operation="AddActivityTask", resource_exhausted_cause="BusyWorkflow", service_name="matching"}
{operation="AddWorkflowTask", resource_exhausted_cause="BusyWorkflow", service_name="matching"}
{operation="QueryWorkflow", resource_exhausted_cause="BusyWorkflow", service_name="history"}
{operation="RecordActivityTaskStarted", resource_exhausted_cause="BusyWorkflow", service_name="history"}
{operation="RecordChildExecutionCompleted", resource_exhausted_cause="BusyWorkflow", service_name="history"}
{operation="RecordWorkflowTaskStarted", resource_exhausted_cause="BusyWorkflow", service_name="history"}
{operation="ScheduleWorkflowTask", resource_exhausted_cause="BusyWorkflow", service_name="history"}
{operation="StartWorkflowExecution", resource_exhausted_cause="BusyWorkflow", service_name="history"}

and at the same time, matching is throwing errors for

{"msg":"history client encountered error","service":"matching","error":"Activity task already started.","service-error-type":"serviceerror.TaskAlreadyStarted","logging-call-at":"metric_client.go:104"}                           
{"msg":"history client encountered error","service":"matching","error":"Workflow is busy.","service-error-type":"serviceerror.ResourceExhausted","logging-call-at":"metric_client.go:104"}

and history

{"level":"info","ts":"2023-07-20T20:58:18.964Z","msg":"history client encountered error","service":"history","error":"Workflow is busy.","service-error-type":"serviceerror.ResourceExhausted","logging-call-at":"metric_client.go:104"}                                          
{"level":"info","ts":"2023-07-20T20:58:19.566Z","msg":"matching client encountered error","service":"history","error":"Workflow is busy.","service-error-type":"serviceerror.ResourceExhausted","logging-call-at":"metric_client.go:219"}

and no errors in the frontend.
Dynamic config only overrides the following:

    history.rps:
      - value: 4000

    matching.rps:
      - value: 3000

    matching.persistenceMaxQPS:
      - value: 4000

and frontend, history and matching they have 2 replicas each, all of them are using cpu and memory very well within their capacities (20-30%). The Aurora database it’s a r6i.2xlarge with around 45% cpu usage.

Nits_Aresta · September 9, 2023, 3:45pm

Hello @vcardenas , were you able to solve these issues? If yes what was the issue and how you solved it? Thanks

Raul-Ronald · November 22, 2023, 5:47pm

Seeing a similar phenomenon

tihomir · November 27, 2023, 4:15pm

BusyWorkflow

Resource exhausted on BusyWorkflow means workflow lock cannot be acquired in time (500ms). Each update to an execution is done under a lock.

Can you please check latency for HistoryCacheGetOrCreate operation:

histogram_quantile(0.99, sum(rate(cache_latency_bucket{operation=“HistoryCacheGetOrCreate”}[1m])) by (le))

see if latencies align with your resource exhausted graph.
Typically this can happen if you schedule a very large number of activities/child workflows at once, and/or large number of activities that all heartbeat at high rate.
If that’s the case its recommended to start a smaller number of activities/child workflows per single workflow execution, or start them across multiple workflow executions if possible.

pratyay · June 10, 2024, 10:22am

We are also seeing “msg”:“history client encountered error”,“service”:“history”,“error”:“Workflow is busy.”,“service-error-type”:“serviceerror.ResourceExhausted”.
We are creating only one child per parent workflow. What are other factors to reduce workflow mutable state cache lock latencies?

Topic		Replies	Views
How to investigate or solve occasional shard operations causing ResourceExhausted errors Server Deployment	2	145	January 14, 2025
RESOURCE_EXHAUSTED: Too many outstanding requests to the service Community Support	5	1870	May 17, 2021
How to read Grafana Performance metrics Community Support	6	77	January 28, 2025
Seeing high latencies between two subsequent activity task executions Community Support java-sdk , cassandra	22	2939	July 19, 2022
Some activities seem to be stuck & not starting Server Deployment	3	1314	December 10, 2023

Resource exhausted BusyWorkflow

Related topics