Service rate limit exceeded

I deploy the temporal cluster in k8s, and each service has two instances. When I try to test the concurrent performance by maru, I encountered an error.
The frontend service logs is:

{"level":"info","ts":"2022-12-03T02:33:52.531Z","msg":"matching client encountered error","service":"frontend","error":"service rate limit exceeded","service-error-type":"serviceerror.ResourceExhausted","logging-call-at":"metric_client.go:217"}

The history service logs is:

{"level":"error","ts":"2022-12-05T02:27:29.087Z","msg":"Fail to process task","shard-id":461,"address":"10.244.1.112:17234","component":"visibility-queue-processor","wf-namespace-id":"2dc8a5ed-aa72-4801-8045-16b95767bd56","wf-id":"basic-workflow-2e24fcdf-c3e9-4629-8c16-21205d1e6f46-0-184-1","wf-run-id":"64d8fb96-9b04-4d74-9baf-7bfd21cbc8a9","queue-task-id":5243799,"queue-task-visibility-timestamp":"2022-12-05T02:27:24.182Z","queue-task-type":"VisibilityStartExecution","queue-task":{"NamespaceID":"2dc8a5ed-aa72-4801-8045-16b95767bd56","WorkflowID":"basic-workflow-2e24fcdf-c3e9-4629-8c16-21205d1e6f46-0-184-1","RunID":"64d8fb96-9b04-4d74-9baf-7bfd21cbc8a9","VisibilityTimestamp":"2022-12-05T02:27:24.182222572Z","TaskID":5243799,"Version":0},"wf-history-event-id":0,"error":"shard status unknown","lifecycle":"ProcessingFailed","logging-call-at":"lazy_logger.go:68","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:143\ngo.temporal.io/server/common/log.(*lazyLogger).Error\n\t/home/builder/temporal/common/log/lazy_logger.go:68\ngo.temporal.io/server/service/history/queues.(*executableImpl).HandleErr\n\t/home/builder/temporal/service/history/queues/executable.go:289\ngo.temporal.io/server/common/tasks.(*FIFOScheduler[...]).executeTask.func1\n\t/home/builder/temporal/common/tasks/fifo_scheduler.go:226\ngo.temporal.io/server/common/backoff.ThrottleRetry.func1\n\t/home/builder/temporal/common/backoff/retry.go:170\ngo.temporal.io/server/common/backoff.ThrottleRetryContext\n\t/home/builder/temporal/common/backoff/retry.go:194\ngo.temporal.io/server/common/backoff.ThrottleRetry\n\t/home/builder/temporal/common/backoff/retry.go:171\ngo.temporal.io/server/common/tasks.(*FIFOScheduler[...]).executeTask\n\t/home/builder/temporal/common/tasks/fifo_scheduler.go:235\ngo.temporal.io/server/common/tasks.(*FIFOScheduler[...]).processTask\n\t/home/builder/temporal/common/tasks/fifo_scheduler.go:211"}
{"level":"error","ts":"2022-12-05T02:27:29.087Z","msg":"Operation failed with internal error.","error":"GetVisibilityTasks operation failed. Select failed. Error: context deadline exceeded","metric-scope":20,"logging-call-at":"persistenceMetricClients.go:1461","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:143\ngo.temporal.io/server/common/persistence.(*metricEmitter).updateErrorMetric\n\t/home/builder/temporal/common/persistence/persistenceMetricClients.go:1461\ngo.temporal.io/server/common/persistence.(*executionPersistenceClient).GetHistoryTasks\n\t/home/builder/temporal/common/persistence/persistenceMetricClients.go:439\ngo.temporal.io/server/common/persistence.(*executionRetryablePersistenceClient).GetHistoryTasks.func1\n\t/home/builder/temporal/common/persistence/persistenceRetryableClients.go:366\ngo.temporal.io/server/common/backoff.ThrottleRetryContext\n\t/home/builder/temporal/common/backoff/retry.go:194\ngo.temporal.io/server/common/persistence.(*executionRetryablePersistenceClient).GetHistoryTasks\n\t/home/builder/temporal/common/persistence/persistenceRetryableClients.go:370\ngo.temporal.io/server/service/history.(*visibilityQueueProcessorImpl).readTasks\n\t/home/builder/temporal/service/history/visibilityQueueProcessor.go:336\ngo.temporal.io/server/service/history.(*queueAckMgrImpl).readQueueTasks\n\t/home/builder/temporal/service/history/queueAckMgr.go:123\ngo.temporal.io/server/service/history.(*queueProcessorBase).processBatch\n\t/home/builder/temporal/service/history/queueProcessorBase.go:237\ngo.temporal.io/server/service/history.(*queueProcessorBase).processorPump\n\t/home/builder/temporal/service/history/queueProcessorBase.go:196"}
{"level":"error","ts":"2022-12-05T02:27:29.087Z","msg":"Processor unable to retrieve tasks","shard-id":415,"address":"10.244.1.112:17234","component":"visibility-queue-processor","error":"context deadline exceeded","logging-call-at":"queueProcessorBase.go:239","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_l

What is causing this problem, and how do I solve it?

frontend service - “service rate limit exceeded”

Try increasing dynamic config frontend.namespacerps (default 2400), for example:

frontend.namespacerps:
- value: <your_rps_limit>
  constraints: {}

For history, matching service rps limits:

history.rps (default 3000)
matching.rps (default 1200)

@tihomir

Should frontend.rps be specified together with frontend.namespaceRPS?

Also the error like this

{“level”:“info”,“ts”:“2023-06-22T15:32:32.358Z”,“msg”:“history client encountered error”,“service”:“frontend”,“error”:“service rate limit exceeded”,“service-error-type”:“serviceerror.ResourceExhausted”,“logging-call-at”:“metricClient.go:638”}

Does it indicate that the frontend service cannot call the history service due to rate limit?

Have increased RPS parameters a lot but still get plenty of “service rate limit exceeded” errors after a few minutes. Restarting history pods can solve the problem temporarily.

Should frontend.rps be specified together with frontend.namespaceRPS?

No, both have default value if not specified (2400)

Does it indicate that the frontend service cannot call the history service due to rate limit?

This is rps for history service that can get requests from frontend and matching.

Have increased RPS parameters a lot but still get plenty of “service rate limit exceeded” errors after a few minutes. Restarting history pods can solve the problem temporarily.

Would check which operation(s) are causing you to hit the history.rps limit:


sum(rate(service_errors_resource_exhausted{}[1m])) by (operation, resource_exhausted_cause)

There is only one type of service_errors_resource_exhausted: activity retry - DescribeWorkflowExecution, cause: Unspecified.

It’s a dev zone cluster and not much traffic. It’s really confusing how rps limit is reached almost immediately.

frontend.namespaceRPS:
  - value: 24000
    constraints: {}
history.namespaceRPS:
  - value: 24000
    constraints: {}
frontend.rps:
  - value: 100000
    constraints: {}
history.rps:
  - value: 100000
    constraints: {}

Does this look right?

{“message”:“8 RESOURCE_EXHAUSTED: service rate limit exceeded. method: describeWorkflow, req: xxx”}

On UI, some workflows, not all, cannot be loaded and failed with error as above.

There is only one type of service_errors_resource_exhausted: activity retry - DescribeWorkflowExecution, cause: Unspecified.

Can you share the graph for mentioned Grafana query please?

Does this look right?

I dont think history.namespaceRPS is a dynamic config property. You can set history.rps which is per history host (default 3000).
Also you should not set frontend.namespaceRPS to be greater than frontend.rps just fyi.

But in frontend service logs, the errors keep happening. Captured last 5 mins log for your information.

I will remove history.namespaceRPS.
frontend.rps is 100k and frontend.namespaceRPS is 24k.

BTW, we are still on the old temporal UI. The behavior is - some workflows can be opened and some can not due to this service rate limit issue. Also sometimes task queue page is empty, which was solved by adding more worker service nodes.

Thanks for sharing, which server version are you running?


tctl adm cl d | jq .serverVersion

v1.16.2

It could be an issue that our EKS has more than 1 temporal cluster like this Prevent incorrect service discovery with multiple Temporal clusters · Issue #1234 · temporalio/temporal · GitHub

After switching ports, the problem is gone.

Hi Jerry, I’m also now performing load testing to our temporal cluster deployed in k8s and trying to fine tune resource allocation (cpu/memory) and temporal dynamic configurations.

dynamic_config.yaml

    frontend.rps:
    - value: 100000
      constraints: {}
    frontend.namespaceRPS:
    - value: 50000
      constraints: {}
    frontend.namespaceCount:
    - value: 50000
      constraints: {}
    history.rps:
    - value: 100000
      constraints: {}
    matching.rps:
    - value: 100000
      constraints: {}
    matching.numTaskqueueReadPartitions:
    - value: 5
      constraints: {}
    matching.numTaskqueueWritePartitions:
    - value: 5
      constraints: {}

Temporal services resources:

      resources:
        requests:
          memory: 4Gi
          cpu: 1
        limits:
          memory: 4Gi
          cpu: 1

I still see history with resource exhausted (busyworkflow) deadlineexceeded errors (metrics) and the following errors in the logs

{"level":"error","ts":"2023-07-07T11:19:52.452Z","msg":"Fail to process task","shard-id":296,"address":"172.16.61.209:7234","component":"transfer-queue-processor","wf-namespace-id":"833e2f1f-25dc-4682-ae0a-2031f8882387","wf-id":"basic-workflow-36-0-1-225","wf-run-id":"8d5e06a3-b149-4d4a-996e-466b0c49fef9","queue-task-id":36700283,"queue-task-visibility-timestamp":"2023-07-07T11:19:50.124Z","queue-task-type":"TransferActivityTask","queue-task":{"NamespaceID":"833e2f1f-25dc-4682-ae0a-2031f8882387","WorkflowID":"basic-workflow-36-0-1-225","RunID":"8d5e06a3-b149-4d4a-996e-466b0c49fef9","VisibilityTimestamp":"2023-07-07T11:19:50.124869485Z","TaskID":36700283,"TaskQueue":"temporal-basic-act","ScheduledEventID":17,"Version":0},"wf-history-event-id":17,"error":"context deadline exceeded","lifecycle":"ProcessingFailed","logging-call-at":"lazy_logger.go:68","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:156\ngo.temporal.io/server/common/log.(*lazyLogger).Error\n\t/home/builder/temporal/common/log/lazy_logger.go:68\ngo.temporal.io/server/service/history/queues.(*executableImpl).HandleErr\n\t/home/builder/temporal/service/history/queues/executable.go:344\ngo.temporal.io/server/common/tasks.(*FIFOScheduler[...]).executeTask.func1\n\t/home/builder/temporal/common/tasks/fifo_scheduler.go:224\ngo.temporal.io/server/common/backoff.ThrottleRetry.func1\n\t/home/builder/temporal/common/backoff/retry.go:119\ngo.temporal.io/server/common/backoff.ThrottleRetryContext\n\t/home/builder/temporal/common/backoff/retry.go:145\ngo.temporal.io/server/common/backoff.ThrottleRetry\n\t/home/builder/temporal/common/backoff/retry.go:120\ngo.temporal.io/server/common/tasks.(*FIFOScheduler[...]).executeTask\n\t/home/builder/temporal/common/tasks/fifo_scheduler.go:233\ngo.temporal.io/server/common/tasks.(*FIFOScheduler[...]).processTask\n\t/home/builder/temporal/common/tasks/fifo_scheduler.go:211"}

Which ports did you switch?

Can you share more details about your current configuration?

Thanks

The problem we had was that another team deployed temporal on the same EKS as ours so that their nodes were recognized as part of our cluster as well. Starting from the moment, we experienced a lot of rate limit hit issue. After switching the MEMBERSHIP ports (i.e. FRONTEND_MEMBERSHIP_PORT, HISTORY_MEMBERSHIP_PORT via environment variable set) to different values, the issue was immediately gone.