RESOURCE_EXHAUSTED: namespace rate limit exceeded

Hi,

We are experiencing a high number of the below exception thrown in one of the activities invoked in our workflow while load testing.

nested exception is io.grpc.StatusRuntimeException: RESOURCE_EXHAUSTED: namespace rate limit exceeded; nested exception is io.grpc.StatusRuntimeException: RESOURCE_EXHAUSTED: namespace rate limit exceeded

Based on what we understand from the documentation, we have increased the below values in our dynamic config. However, that doesn’t really solve the rate limit problem.

I would highly appreciate some guidance on this.
Thanks

# MatchingRPS, default is 1200
matching.rps:
- value: 76800
  constraints: {}
# HistoryRPS, default is 3000
history.rps:
- value: 76800
  constraints: {}
# FrontendRPS, default is 2400
frontend.rps:
- value: 76800
  constraints: {}
# FrontendMaxNamespaceRPSPerInstance, default is 2400
frontend.namespaceRPS:
- value: 76800
  constraints: {}
# FrontendMaxNamespaceCountPerInstance, default is 1200
frontend.namespaceCount:
- value: 9600
  constraints: {}

Our cluster looks like this:


History: 
- 20 replicas
- resources requests: 2 CPU, 4G Mem
- resource limits: 4 CPU, 6G Mem

Matching:
- 8 replicas
- resources requests: 2 CPU, 1G Mem
- resource limits: 4 CPU, 1G Mem

Worker:
- 8 replicas
- resources requests: 1 CPU, 1G Mem
- resource limits: 1 CPU, 1G Mem

Frontend:
- 12 replicas
- resources requests: 2 CPU, 2G Mem
- resource limits: 4 CPU, 2G Mem

RESOURCE_EXHAUSTED: namespace rate limit exceeded

Namespace rate limit is rate limit per namespace per frontend, are you configuring frontend.namespaceRPS on each of your frontend hosts? Yes, default for this dynamic config is 2400.

Do you have server metrics enabled? If you do can you try following query in Grafana:

sum(rate(service_errors_resource_exhausted{}[1m])) by (resource_exhausted_cause)

Thank you for your reply Tihomir.

I did increase the frontend.namespaceRPS. Below is a screenshot of the metric.
frontend.namespaceRPS:

Thanks for the info, and just to confirm you set the same dynamic config for all of your frontend pods?

Regarding operation ListWorkflowExecutions, do you have some client code that calls ListWorkflowExeutions apis a lot? How many running executions do you have:

tctl wf count -q "ExecutionStatus='Running'"

Could we check your visibility latencies by chance please:

histogram_quantile(0.95, sum(rate(task_latency_bucket{operation=~"VisibilityTask.*", service_name="history"}[1m])) by (operation, le))

Thank you for your quick reply Tihomir.
Our use case of temporal involves querying currently running workflows extensively.
I do not have the number of running executions from the time the load test was ran, but it would be in the 10s of thousands when the client code called get workflows/query workflow.

During the same run, we have also noticed from client SDK metrics, that we had all workflows in cache. would there be a latency impact if we query workflow from worker 1, when it is in worker cache 2 ?

@TSlaoui Would you elaborate on your use case? Why do you need to query extensively? It is rarely a good pattern for one workflow to query another.

@maxim -

Thank you for your reply. We are not querying from one workflow to another.

This is our use case (we are using Java SDK):

  1. Create workflow client stub
  2. Start workflow by invoking the stub’s workflow method.
    • Inside the method is an infinite loop, waiting for an exit condition to be satisfied.
    • The client stub implements query methods and a signal method.
  3. When we receive some input, we use workflowId to get the client stub and then signal the execution.
    • Based on the signal received, we invoke different activities.
  4. Periodically, we use workflowID to get the client stub and then query the execution.
  5. 3 and 4 are repeated several times until the exit condition from step 2 is satisfied (we check a flag that might change as a result of different computations based on the signals).

These steps are sequential in the life of a workflow execution, meaning that if we start workflow in Worker (client) Process 1, we could signal it and/or query it in the following steps from another Worker process, but always in that order of action.

With this sequence, we are experiencing high latency in the query method, and, we are also experiencing the namespace rate limit exceeded exception mentioned at the start of this thread.

We start 20 workflows per second for a duration of 20 minutes. Based on the sequence of actions from step 1 to 5, a workflow’s end to end latency would be roughly 30 seconds. At the end of the test we would gave 24000 workflows (started and completed), and, we could have hundreds or thousands of executions running at a time.

Inside the method is an infinite loop, waiting for an exit condition to be satisfied.

Can you show this code please? You should use Workflow.await most likely in this scenario.

@tihomir - yes that is correct

The workflow method looks like the below.

Eventually, after multiple signals, completed flag will be == true and the workflow method will return.

  public void startFlow() {
    initStuff();
    while (true) {
      Workflow.await(() -> !queue.isEmpty() || completed);
      if (completed) {
        return;
      }
      var request = queue.remove();
      doStuff(request);
    }
  }

after multiple signals

How many is multiple? I understand we have similar code in the HelloSignal sample but if we are talking about a large number of signals (and assume doStuff is activity invocation), your workflow code would need to call continueAsNew in order not to reach a very high # of history events (50K max).

Can you check the history length for some of your workflows in test? Via tctl:

tctl wf desc -w <wfid> -r <runid> | jq .workflowExecutionInfo.historyLength

@tihomir

your workflow code would need to call continueAsNew in order not to reach a very high # of history events (50K max)

The number of events in this test is 105, with 11 activities and 9 signals.

(and assume doStuff is activity invocation)

That is correct, based on the current state and input from signal, we invoke activities.

These steps are sequential in the life of a workflow execution, meaning that if we start workflow in Worker (client) Process 1, we could signal it and/or query it in the following steps from another Worker process, but always in that order of action.

So you always signal then query? From your workflow description it is not clear how the large number of signals is part of your application.

@maxim - we will signal the workflow multiple times and query multiple times after those signals.

2 of the activities invoke external services asynchronously, and the listeners for their responses signal the workflow as well, by getting client stub from workflow ID.

This is how we end up with said number of events, activities and signals.

Could you check all service requests going through the frontend service for the namespace your running the workflows on:

sum(rate(service_requests{service_name="frontend", namespace="<ns_name>"}[2m])) by (operation)

Think we need to find out what operation(s) are contributing to reaching the ns rate limit.

Also, are you also setting frontend.rps in dynamic config? (default 2400) This is the frontend overall rps limit.

Could you check all service requests going through the frontend service for the namespace your running the workflows on: sum(rate(service_requests{service_name="frontend", namespace="<ns_name>"}[2m])) by (operation)

frontend:

Also, are you also setting frontend.rps in dynamic config? (default 2400) This is the frontend overall rps limit.

Yes, below is from our dynamic config.

frontend.rps:
- value: 76800
  constraints: {}
frontend.namespaceRPS:
- value: 76800
  constraints: {}

service requests for history was roughly the same as frontend, and matching was roughly double at times (I can only include one image per post as I am still new to the forum).

We have increased rps for matching and history before the same load test.

matching.rps:
- value: 76800
  constraints: {}
history.rps:
- value: 76800
  constraints: {}