RESOURCE_EXHAUSTED: Too many outstanding requests to the service

Hello,

We are load testing a simple workflow with 1 activity (that prints to console) on a single Aurora MySQL db to figure out the limits of system. At high QPS, I’m starting to see the below errors in the log and wanted to find out how I can scale further. The db is at 50% CPU, so definitely has a lot of headroom left.

Stack trace:
{"@timestamp":“2021-05-14T08:28:39.799+00:00”,“trace_id”:“9mgL46qlHwuaJkkHfpSe_w==”,“span_id”:“asaGgrywBI4tBd1pQCy27w==”,“level”:“WARN”,“level_value”:30000,“logger_name”:“io.temporal.internal.common.GrpcRetryer”,“thread_name”:“pool-16-thread-53”,“exception”:{“class”:“io.grpc.StatusRuntimeException”,“message”:“RESOURCE_EXHAUSTED: Too many outstanding requests to the service.”},“fingerprint”:“5bf31268”,“message”:“Retrying after failure”}
! io.grpc.StatusRuntimeException: RESOURCE_EXHAUSTED: Too many outstanding requests to the service.
! at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:262)
! at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:243)
! at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:156)
! at io.temporal.api.workflowservice.v1.WorkflowServiceGrpc$WorkflowServiceBlockingStub.startWorkflowExecution(WorkflowServiceGrpc.java:2614)
! at io.temporal.internal.external.GenericWorkflowClientExternalImpl.lambda$start$0(GenericWorkflowClientExternalImpl.java:88)
! at io.temporal.internal.common.GrpcRetryer.retryWithResult(GrpcRetryer.java:97)
! at io.temporal.internal.external.GenericWorkflowClientExternalImpl.start(GenericWorkflowClientExternalImpl.java:81)
! at io.temporal.internal.sync.WorkflowStubImpl.startWithOptions(WorkflowStubImpl.java:155)
! at io.temporal.internal.sync.WorkflowStubImpl.start(WorkflowStubImpl.java:268)
! at io.temporal.internal.sync.WorkflowInvocationHandler$StartWorkflowInvocationHandler.invoke(WorkflowInvocationHandler.java:245)
! at io.temporal.internal.sync.WorkflowInvocationHandler.invoke(WorkflowInvocationHandler.java:181)
! at com.sun.proxy.$Proxy130.execute(Unknown Source)
! at io.temporal.internal.sync.WorkflowClientInternal.start(WorkflowClientInternal.java:218)
! at io.temporal.client.WorkflowClient.start(WorkflowClient.java:238)

How many shards have you configured for this cluster? How many nodes of each role do you have? What is these nodes CPU utlization?

How many shards have you configured for this cluster?

512

How many nodes of each role do you have?

temporal web: 6 pods
temporal frontend: 12 pods
temporal history: 30 pods
temporal matching: 9 pods
temporal worker: 9 pods

What is these nodes CPU utlization?

Hard to exactly say given that we are running on top of K8s. Initially when we started the load test, we were breaching the CPU limit of cores. Increased the history server to 30 pods and now we briefly exceed the CPU limit of 4 cores, but not as bad as before

try to increase the following dynamic config to higher value

frontend.rps
frontend.namespaceRPS
frontend.namespaceCount

Thanks @Wenquan_Xing ! Quick question, what are the default values for:

frontend.rps
frontend.namespaceRPS
frontend.namespaceCount