Hello, I am looking for some advice or hints to figure out what the following errors are indicating that we need to increase. This is a Temporal cluster 1.21.2 in EKS usign Aurora postgres as a backend
This metric
sum(rate(service_errors_resource_exhausted[2m])) by (operation,resource_exhausted_cause,service_name)
is showing a high error rate of
{operation="AddActivityTask", resource_exhausted_cause="BusyWorkflow", service_name="matching"}
{operation="AddWorkflowTask", resource_exhausted_cause="BusyWorkflow", service_name="matching"}
{operation="QueryWorkflow", resource_exhausted_cause="BusyWorkflow", service_name="history"}
{operation="RecordActivityTaskStarted", resource_exhausted_cause="BusyWorkflow", service_name="history"}
{operation="RecordChildExecutionCompleted", resource_exhausted_cause="BusyWorkflow", service_name="history"}
{operation="RecordWorkflowTaskStarted", resource_exhausted_cause="BusyWorkflow", service_name="history"}
{operation="ScheduleWorkflowTask", resource_exhausted_cause="BusyWorkflow", service_name="history"}
{operation="StartWorkflowExecution", resource_exhausted_cause="BusyWorkflow", service_name="history"}
and at the same time, matching is throwing errors for
{"msg":"history client encountered error","service":"matching","error":"Activity task already started.","service-error-type":"serviceerror.TaskAlreadyStarted","logging-call-at":"metric_client.go:104"}
{"msg":"history client encountered error","service":"matching","error":"Workflow is busy.","service-error-type":"serviceerror.ResourceExhausted","logging-call-at":"metric_client.go:104"}
and history
{"level":"info","ts":"2023-07-20T20:58:18.964Z","msg":"history client encountered error","service":"history","error":"Workflow is busy.","service-error-type":"serviceerror.ResourceExhausted","logging-call-at":"metric_client.go:104"}
{"level":"info","ts":"2023-07-20T20:58:19.566Z","msg":"matching client encountered error","service":"history","error":"Workflow is busy.","service-error-type":"serviceerror.ResourceExhausted","logging-call-at":"metric_client.go:219"}
and no errors in the frontend.
Dynamic config only overrides the following:
history.rps:
- value: 4000
matching.rps:
- value: 3000
matching.persistenceMaxQPS:
- value: 4000
and frontend, history and matching they have 2 replicas each, all of them are using cpu and memory very well within their capacities (20-30%). The Aurora database it’s a r6i.2xlarge with around 45% cpu usage.