We are experiencing a high number of the below exception thrown in one of the activities invoked in our workflow while load testing.
nested exception is io.grpc.StatusRuntimeException: RESOURCE_EXHAUSTED: namespace rate limit exceeded; nested exception is io.grpc.StatusRuntimeException: RESOURCE_EXHAUSTED: namespace rate limit exceeded
Based on what we understand from the documentation, we have increased the below values in our dynamic config. However, that doesn’t really solve the rate limit problem.
I would highly appreciate some guidance on this.
Thanks
Namespace rate limit is rate limit per namespace per frontend, are you configuring frontend.namespaceRPS on each of your frontend hosts? Yes, default for this dynamic config is 2400.
Do you have server metrics enabled? If you do can you try following query in Grafana:
sum(rate(service_errors_resource_exhausted{}[1m])) by (resource_exhausted_cause)
Thanks for the info, and just to confirm you set the same dynamic config for all of your frontend pods?
Regarding operation ListWorkflowExecutions, do you have some client code that calls ListWorkflowExeutions apis a lot? How many running executions do you have:
tctl wf count -q "ExecutionStatus='Running'"
Could we check your visibility latencies by chance please:
histogram_quantile(0.95, sum(rate(task_latency_bucket{operation=~"VisibilityTask.*", service_name="history"}[1m])) by (operation, le))
Thank you for your quick reply Tihomir.
Our use case of temporal involves querying currently running workflows extensively.
I do not have the number of running executions from the time the load test was ran, but it would be in the 10s of thousands when the client code called get workflows/query workflow.
During the same run, we have also noticed from client SDK metrics, that we had all workflows in cache. would there be a latency impact if we query workflow from worker 1, when it is in worker cache 2 ?
Thank you for your reply. We are not querying from one workflow to another.
This is our use case (we are using Java SDK):
Create workflow client stub
Start workflow by invoking the stub’s workflow method.
Inside the method is an infinite loop, waiting for an exit condition to be satisfied.
The client stub implements query methods and a signal method.
When we receive some input, we use workflowId to get the client stub and then signal the execution.
Based on the signal received, we invoke different activities.
Periodically, we use workflowID to get the client stub and then query the execution.
3 and 4 are repeated several times until the exit condition from step 2 is satisfied (we check a flag that might change as a result of different computations based on the signals).
These steps are sequential in the life of a workflow execution, meaning that if we start workflow in Worker (client) Process 1, we could signal it and/or query it in the following steps from another Worker process, but always in that order of action.
With this sequence, we are experiencing high latency in the query method, and, we are also experiencing the namespace rate limit exceeded exception mentioned at the start of this thread.
We start 20 workflows per second for a duration of 20 minutes. Based on the sequence of actions from step 1 to 5, a workflow’s end to end latency would be roughly 30 seconds. At the end of the test we would gave 24000 workflows (started and completed), and, we could have hundreds or thousands of executions running at a time.
How many is multiple? I understand we have similar code in the HelloSignal sample but if we are talking about a large number of signals (and assume doStuff is activity invocation), your workflow code would need to call continueAsNew in order not to reach a very high # of history events (50K max).
Can you check the history length for some of your workflows in test? Via tctl:
These steps are sequential in the life of a workflow execution, meaning that if we start workflow in Worker (client) Process 1, we could signal it and/or query it in the following steps from another Worker process, but always in that order of action.
So you always signal then query? From your workflow description it is not clear how the large number of signals is part of your application.
@maxim - we will signal the workflow multiple times and query multiple times after those signals.
2 of the activities invoke external services asynchronously, and the listeners for their responses signal the workflow as well, by getting client stub from workflow ID.
This is how we end up with said number of events, activities and signals.
Could you check all service requests going through the frontend service for the namespace your running the workflows on: sum(rate(service_requests{service_name="frontend", namespace="<ns_name>"}[2m])) by (operation)
service requests for history was roughly the same as frontend, and matching was roughly double at times (I can only include one image per post as I am still new to the forum).
We have increased rps for matching and history before the same load test.