Service rate limit exceeded

haojie · December 5, 2022, 3:05am

I deploy the temporal cluster in k8s, and each service has two instances. When I try to test the concurrent performance by maru, I encountered an error.
The frontend service logs is:

{"level":"info","ts":"2022-12-03T02:33:52.531Z","msg":"matching client encountered error","service":"frontend","error":"service rate limit exceeded","service-error-type":"serviceerror.ResourceExhausted","logging-call-at":"metric_client.go:217"}

The history service logs is:

{"level":"error","ts":"2022-12-05T02:27:29.087Z","msg":"Fail to process task","shard-id":461,"address":"10.244.1.112:17234","component":"visibility-queue-processor","wf-namespace-id":"2dc8a5ed-aa72-4801-8045-16b95767bd56","wf-id":"basic-workflow-2e24fcdf-c3e9-4629-8c16-21205d1e6f46-0-184-1","wf-run-id":"64d8fb96-9b04-4d74-9baf-7bfd21cbc8a9","queue-task-id":5243799,"queue-task-visibility-timestamp":"2022-12-05T02:27:24.182Z","queue-task-type":"VisibilityStartExecution","queue-task":{"NamespaceID":"2dc8a5ed-aa72-4801-8045-16b95767bd56","WorkflowID":"basic-workflow-2e24fcdf-c3e9-4629-8c16-21205d1e6f46-0-184-1","RunID":"64d8fb96-9b04-4d74-9baf-7bfd21cbc8a9","VisibilityTimestamp":"2022-12-05T02:27:24.182222572Z","TaskID":5243799,"Version":0},"wf-history-event-id":0,"error":"shard status unknown","lifecycle":"ProcessingFailed","logging-call-at":"lazy_logger.go:68","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:143\ngo.temporal.io/server/common/log.(*lazyLogger).Error\n\t/home/builder/temporal/common/log/lazy_logger.go:68\ngo.temporal.io/server/service/history/queues.(*executableImpl).HandleErr\n\t/home/builder/temporal/service/history/queues/executable.go:289\ngo.temporal.io/server/common/tasks.(*FIFOScheduler[...]).executeTask.func1\n\t/home/builder/temporal/common/tasks/fifo_scheduler.go:226\ngo.temporal.io/server/common/backoff.ThrottleRetry.func1\n\t/home/builder/temporal/common/backoff/retry.go:170\ngo.temporal.io/server/common/backoff.ThrottleRetryContext\n\t/home/builder/temporal/common/backoff/retry.go:194\ngo.temporal.io/server/common/backoff.ThrottleRetry\n\t/home/builder/temporal/common/backoff/retry.go:171\ngo.temporal.io/server/common/tasks.(*FIFOScheduler[...]).executeTask\n\t/home/builder/temporal/common/tasks/fifo_scheduler.go:235\ngo.temporal.io/server/common/tasks.(*FIFOScheduler[...]).processTask\n\t/home/builder/temporal/common/tasks/fifo_scheduler.go:211"}
{"level":"error","ts":"2022-12-05T02:27:29.087Z","msg":"Operation failed with internal error.","error":"GetVisibilityTasks operation failed. Select failed. Error: context deadline exceeded","metric-scope":20,"logging-call-at":"persistenceMetricClients.go:1461","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:143\ngo.temporal.io/server/common/persistence.(*metricEmitter).updateErrorMetric\n\t/home/builder/temporal/common/persistence/persistenceMetricClients.go:1461\ngo.temporal.io/server/common/persistence.(*executionPersistenceClient).GetHistoryTasks\n\t/home/builder/temporal/common/persistence/persistenceMetricClients.go:439\ngo.temporal.io/server/common/persistence.(*executionRetryablePersistenceClient).GetHistoryTasks.func1\n\t/home/builder/temporal/common/persistence/persistenceRetryableClients.go:366\ngo.temporal.io/server/common/backoff.ThrottleRetryContext\n\t/home/builder/temporal/common/backoff/retry.go:194\ngo.temporal.io/server/common/persistence.(*executionRetryablePersistenceClient).GetHistoryTasks\n\t/home/builder/temporal/common/persistence/persistenceRetryableClients.go:370\ngo.temporal.io/server/service/history.(*visibilityQueueProcessorImpl).readTasks\n\t/home/builder/temporal/service/history/visibilityQueueProcessor.go:336\ngo.temporal.io/server/service/history.(*queueAckMgrImpl).readQueueTasks\n\t/home/builder/temporal/service/history/queueAckMgr.go:123\ngo.temporal.io/server/service/history.(*queueProcessorBase).processBatch\n\t/home/builder/temporal/service/history/queueProcessorBase.go:237\ngo.temporal.io/server/service/history.(*queueProcessorBase).processorPump\n\t/home/builder/temporal/service/history/queueProcessorBase.go:196"}
{"level":"error","ts":"2022-12-05T02:27:29.087Z","msg":"Processor unable to retrieve tasks","shard-id":415,"address":"10.244.1.112:17234","component":"visibility-queue-processor","error":"context deadline exceeded","logging-call-at":"queueProcessorBase.go:239","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_l

What is causing this problem, and how do I solve it?

tihomir · December 5, 2022, 4:03am

frontend service - “service rate limit exceeded”

Try increasing dynamic config frontend.namespacerps (default 2400), for example:

frontend.namespacerps:
- value: <your_rps_limit>
  constraints: {}

For history, matching service rps limits:

history.rps (default 3000)
matching.rps (default 1200)

JerryHu · June 22, 2023, 3:35pm

@tihomir

Should frontend.rps be specified together with frontend.namespaceRPS?

JerryHu · June 22, 2023, 3:42pm

Also the error like this

{“level”:“info”,“ts”:“2023-06-22T15:32:32.358Z”,“msg”:“history client encountered error”,“service”:“frontend”,“error”:“service rate limit exceeded”,“service-error-type”:“serviceerror.ResourceExhausted”,“logging-call-at”:“metricClient.go:638”}

Does it indicate that the frontend service cannot call the history service due to rate limit?

JerryHu · June 22, 2023, 4:09pm

Have increased RPS parameters a lot but still get plenty of “service rate limit exceeded” errors after a few minutes. Restarting history pods can solve the problem temporarily.

tihomir · June 22, 2023, 7:39pm

Should frontend.rps be specified together with frontend.namespaceRPS?

No, both have default value if not specified (2400)

Does it indicate that the frontend service cannot call the history service due to rate limit?

This is rps for history service that can get requests from frontend and matching.

Have increased RPS parameters a lot but still get plenty of “service rate limit exceeded” errors after a few minutes. Restarting history pods can solve the problem temporarily.

Would check which operation(s) are causing you to hit the history.rps limit:


sum(rate(service_errors_resource_exhausted{}[1m])) by (operation, resource_exhausted_cause)

JerryHu · June 22, 2023, 9:38pm

There is only one type of service_errors_resource_exhausted: activity retry - DescribeWorkflowExecution, cause: Unspecified.

It’s a dev zone cluster and not much traffic. It’s really confusing how rps limit is reached almost immediately.

JerryHu · June 22, 2023, 9:55pm

frontend.namespaceRPS:
  - value: 24000
    constraints: {}
history.namespaceRPS:
  - value: 24000
    constraints: {}
frontend.rps:
  - value: 100000
    constraints: {}
history.rps:
  - value: 100000
    constraints: {}

Does this look right?

JerryHu · June 22, 2023, 10:17pm

{“message”:“8 RESOURCE_EXHAUSTED: service rate limit exceeded. method: describeWorkflow, req: xxx”}

On UI, some workflows, not all, cannot be loaded and failed with error as above.

tihomir · June 22, 2023, 11:46pm

There is only one type of service_errors_resource_exhausted: activity retry - DescribeWorkflowExecution, cause: Unspecified.

Can you share the graph for mentioned Grafana query please?

Does this look right?

I dont think history.namespaceRPS is a dynamic config property. You can set history.rps which is per history host (default 3000).
Also you should not set frontend.namespaceRPS to be greater than frontend.rps just fyi.

JerryHu · June 23, 2023, 12:25am

But in frontend service logs, the errors keep happening. Captured last 5 mins log for your information.

I will remove history.namespaceRPS.
frontend.rps is 100k and frontend.namespaceRPS is 24k.

JerryHu · June 23, 2023, 12:27am

BTW, we are still on the old temporal UI. The behavior is - some workflows can be opened and some can not due to this service rate limit issue. Also sometimes task queue page is empty, which was solved by adding more worker service nodes.

tihomir · June 23, 2023, 2:33pm

Thanks for sharing, which server version are you running?


tctl adm cl d | jq .serverVersion

JerryHu · June 23, 2023, 6:04pm

v1.16.2

It could be an issue that our EKS has more than 1 temporal cluster like this Prevent incorrect service discovery with multiple Temporal clusters · Issue #1234 · temporalio/temporal · GitHub

JerryHu · June 27, 2023, 12:51am

After switching ports, the problem is gone.

Nits_Aresta · July 7, 2023, 11:15am

Hi Jerry, I’m also now performing load testing to our temporal cluster deployed in k8s and trying to fine tune resource allocation (cpu/memory) and temporal dynamic configurations.

dynamic_config.yaml

    frontend.rps:
    - value: 100000
      constraints: {}
    frontend.namespaceRPS:
    - value: 50000
      constraints: {}
    frontend.namespaceCount:
    - value: 50000
      constraints: {}
    history.rps:
    - value: 100000
      constraints: {}
    matching.rps:
    - value: 100000
      constraints: {}
    matching.numTaskqueueReadPartitions:
    - value: 5
      constraints: {}
    matching.numTaskqueueWritePartitions:
    - value: 5
      constraints: {}

Temporal services resources:

      resources:
        requests:
          memory: 4Gi
          cpu: 1
        limits:
          memory: 4Gi
          cpu: 1

I still see history with resource exhausted (busyworkflow) deadlineexceeded errors (metrics) and the following errors in the logs

{"level":"error","ts":"2023-07-07T11:19:52.452Z","msg":"Fail to process task","shard-id":296,"address":"172.16.61.209:7234","component":"transfer-queue-processor","wf-namespace-id":"833e2f1f-25dc-4682-ae0a-2031f8882387","wf-id":"basic-workflow-36-0-1-225","wf-run-id":"8d5e06a3-b149-4d4a-996e-466b0c49fef9","queue-task-id":36700283,"queue-task-visibility-timestamp":"2023-07-07T11:19:50.124Z","queue-task-type":"TransferActivityTask","queue-task":{"NamespaceID":"833e2f1f-25dc-4682-ae0a-2031f8882387","WorkflowID":"basic-workflow-36-0-1-225","RunID":"8d5e06a3-b149-4d4a-996e-466b0c49fef9","VisibilityTimestamp":"2023-07-07T11:19:50.124869485Z","TaskID":36700283,"TaskQueue":"temporal-basic-act","ScheduledEventID":17,"Version":0},"wf-history-event-id":17,"error":"context deadline exceeded","lifecycle":"ProcessingFailed","logging-call-at":"lazy_logger.go:68","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:156\ngo.temporal.io/server/common/log.(*lazyLogger).Error\n\t/home/builder/temporal/common/log/lazy_logger.go:68\ngo.temporal.io/server/service/history/queues.(*executableImpl).HandleErr\n\t/home/builder/temporal/service/history/queues/executable.go:344\ngo.temporal.io/server/common/tasks.(*FIFOScheduler[...]).executeTask.func1\n\t/home/builder/temporal/common/tasks/fifo_scheduler.go:224\ngo.temporal.io/server/common/backoff.ThrottleRetry.func1\n\t/home/builder/temporal/common/backoff/retry.go:119\ngo.temporal.io/server/common/backoff.ThrottleRetryContext\n\t/home/builder/temporal/common/backoff/retry.go:145\ngo.temporal.io/server/common/backoff.ThrottleRetry\n\t/home/builder/temporal/common/backoff/retry.go:120\ngo.temporal.io/server/common/tasks.(*FIFOScheduler[...]).executeTask\n\t/home/builder/temporal/common/tasks/fifo_scheduler.go:233\ngo.temporal.io/server/common/tasks.(*FIFOScheduler[...]).processTask\n\t/home/builder/temporal/common/tasks/fifo_scheduler.go:211"}

Which ports did you switch?

Can you share more details about your current configuration?

Thanks

JerryHu · August 2, 2023, 8:37pm

The problem we had was that another team deployed temporal on the same EKS as ours so that their nodes were recognized as part of our cluster as well. Starting from the moment, we experienced a lot of rate limit hit issue. After switching the MEMBERSHIP ports (i.e. FRONTEND_MEMBERSHIP_PORT, HISTORY_MEMBERSHIP_PORT via environment variable set) to different values, the issue was immediately gone.

Topic		Replies	Views
Rate limit exceeded when replica count > 1 Community Support kubernetes	0	247	January 22, 2024
Temporal Frontend Server giving error messages Community Support helm , kubernetes	3	2025	May 18, 2023
Temporal test bench by maru Community Support go-sdk , helm , metrics	3	751	December 22, 2022
[SOLVED] "context deadline exceeded" & "Not enough hosts to serve requests" errors Community Support kubernetes	1	38134	March 31, 2022
RESOURCE_EXHAUSTED: namespace rate limit exceeded when countWorkflowExecutions and listWorkflowExecutions Community Support java-sdk , elasticsearch	6	602	January 26, 2024

Service rate limit exceeded

Related topics