Namespace Rate Limit Exceeded

Hi,

I’m encountering the RESOURCE_EXHAUSTED: namespace rate limit exceeded error during a load test.

I’m currently testing with a 24 TPS load and have set frontend.globalNamespaceRPS to 50. The namespace rate limit is set higher than the actual load, but the rate limit exceeded error still occurs. I have even tried increasing it to 500, yet the error persists.

From the metrics, RPS limit exceeded happens for {operation="PollWorkflowExecutionHistory", resource_exhausted_cause="RpsLimit"}.

Is there any metrics to measure the actual RPS so that I can determine the correct value for the namespace RPS? How could it be that the RPS limit required is way higher then the actual load?

I would appreciate some guidance on this.

Resource exhausted metric: service_errors_resource_exhausted, for example:

sum(rate(service_errors_resource_exhausted{}[1m])) by (operation, resource_exhausted_cause)

For frontend service overall rps metric: service_requests, for example:

sum(rate(service_requests{namespace="my_namespace_name"}[1m])) by (namespace)

have set frontend.globalNamespaceRPS to 50

this setting controls rps on the entire cluster, meaning distributed across all frontend services. how many frontend services do you deploy?

if you are looking for per-frontendhost-per-namespace setting, use frontend.namespaceRPS

From the metrics, RPS limit exceeded happens for {operation="PollWorkflowExecutionHistory", resource_exhausted_cause="RpsLimit"}

This is long-poll api done either from your clients what sync-wait for execution completion, or contributed by your sdk workers that need to poll event history in order to replay workflow executions.
From worker (sdk metrics) would look at temporal_sticky_cache_miss metric see if its high, if so might need to tune your worker cache size if possible (if your sdk worker memory is not too high) to try to reduce need to call this api at high rates.