RESOURCE_EXHAUSTED: Too many outstanding requests to the service

arjunraman · May 17, 2021, 7:49pm

Hello,

We are load testing a simple workflow with 1 activity (that prints to console) on a single Aurora MySQL db to figure out the limits of system. At high QPS, I’m starting to see the below errors in the log and wanted to find out how I can scale further. The db is at 50% CPU, so definitely has a lot of headroom left.

Stack trace:
{"@timestamp":“2021-05-14T08:28:39.799+00:00”,“trace_id”:“9mgL46qlHwuaJkkHfpSe_w==”,“span_id”:“asaGgrywBI4tBd1pQCy27w==”,“level”:“WARN”,“level_value”:30000,“logger_name”:“io.temporal.internal.common.GrpcRetryer”,“thread_name”:“pool-16-thread-53”,“exception”:{“class”:“io.grpc.StatusRuntimeException”,“message”:“RESOURCE_EXHAUSTED: Too many outstanding requests to the service.”},“fingerprint”:“5bf31268”,“message”:“Retrying after failure”}
! io.grpc.StatusRuntimeException: RESOURCE_EXHAUSTED: Too many outstanding requests to the service.
! at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:262)
! at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:243)
! at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:156)
! at io.temporal.api.workflowservice.v1.WorkflowServiceGrpc$WorkflowServiceBlockingStub.startWorkflowExecution(WorkflowServiceGrpc.java:2614)
! at io.temporal.internal.external.GenericWorkflowClientExternalImpl.lambda$start$0(GenericWorkflowClientExternalImpl.java:88)
! at io.temporal.internal.common.GrpcRetryer.retryWithResult(GrpcRetryer.java:97)
! at io.temporal.internal.external.GenericWorkflowClientExternalImpl.start(GenericWorkflowClientExternalImpl.java:81)
! at io.temporal.internal.sync.WorkflowStubImpl.startWithOptions(WorkflowStubImpl.java:155)
! at io.temporal.internal.sync.WorkflowStubImpl.start(WorkflowStubImpl.java:268)
! at io.temporal.internal.sync.WorkflowInvocationHandler$StartWorkflowInvocationHandler.invoke(WorkflowInvocationHandler.java:245)
! at io.temporal.internal.sync.WorkflowInvocationHandler.invoke(WorkflowInvocationHandler.java:181)
! at com.sun.proxy.$Proxy130.execute(Unknown Source)
! at io.temporal.internal.sync.WorkflowClientInternal.start(WorkflowClientInternal.java:218)
! at io.temporal.client.WorkflowClient.start(WorkflowClient.java:238)

maxim · May 17, 2021, 8:07pm

How many shards have you configured for this cluster? How many nodes of each role do you have? What is these nodes CPU utlization?

arjunraman · May 17, 2021, 8:55pm

How many shards have you configured for this cluster?

512

How many nodes of each role do you have?

temporal web: 6 pods
temporal frontend: 12 pods
temporal history: 30 pods
temporal matching: 9 pods
temporal worker: 9 pods

What is these nodes CPU utlization?

Hard to exactly say given that we are running on top of K8s. Initially when we started the load test, we were breaching the CPU limit of cores. Increased the history server to 30 pods and now we briefly exceed the CPU limit of 4 cores, but not as bad as before

Wenquan_Xing · May 17, 2021, 9:11pm

github.com

temporalio/temporal/blob/36c4bd90dce471603be2cf070b3c62c8c218f644/common/dynamicconfig/constants.go#L107-L108

    
      
          	FrontendMaxNamespaceRPSPerInstance:    "frontend.namespaceRPS",
          	FrontendMaxNamespaceCountPerInstance:  "frontend.namespaceCount",

try to increase the following dynamic config to higher value

frontend.rps
frontend.namespaceRPS
frontend.namespaceCount

arjunraman · May 17, 2021, 9:39pm

Thanks @Wenquan_Xing ! Quick question, what are the default values for:

frontend.rps
frontend.namespaceRPS
frontend.namespaceCount

maxim · May 17, 2021, 9:45pm

github.com

temporalio/temporal/blob/36c4bd90dce471603be2cf070b3c62c8c218f644/service/frontend/service.go#L152

    
      
          		PersistenceGlobalMaxQPS:                dc.GetIntProperty(dynamicconfig.FrontendPersistenceGlobalMaxQPS, 0),
          		VisibilityMaxPageSize:                  dc.GetIntPropertyFilteredByNamespace(dynamicconfig.FrontendVisibilityMaxPageSize, 1000),
          		EnableVisibilitySampling:               dc.GetBoolProperty(dynamicconfig.EnableVisibilitySampling, true),
          		VisibilityListMaxQPS:                   dc.GetIntPropertyFilteredByNamespace(dynamicconfig.FrontendVisibilityListMaxQPS, 30),
          		EnableReadVisibilityFromES:             dc.GetBoolPropertyFnWithNamespaceFilter(dynamicconfig.EnableReadVisibilityFromES, enableReadFromES),
          		ESVisibilityListMaxQPS:                 dc.GetIntPropertyFilteredByNamespace(dynamicconfig.FrontendESVisibilityListMaxQPS, 10),
          		ESIndexMaxResultWindow:                 dc.GetIntProperty(dynamicconfig.FrontendESIndexMaxResultWindow, 10000),
          		HistoryMaxPageSize:                     dc.GetIntPropertyFilteredByNamespace(dynamicconfig.FrontendHistoryMaxPageSize, common.GetHistoryMaxPageSize),
          		RPS:                                    dc.GetIntProperty(dynamicconfig.FrontendRPS, 2400),
          		MaxNamespaceRPSPerInstance:             dc.GetIntPropertyFilteredByNamespace(dynamicconfig.FrontendMaxNamespaceRPSPerInstance, 2400),
          		MaxNamespaceCountPerInstance:           dc.GetIntPropertyFilteredByNamespace(dynamicconfig.FrontendMaxNamespaceCountPerInstance, 1200),
          		GlobalNamespaceRPS:                     dc.GetIntPropertyFilteredByNamespace(dynamicconfig.FrontendGlobalNamespaceRPS, 0),
          		MaxIDLengthLimit:                       dc.GetIntProperty(dynamicconfig.MaxIDLengthLimit, 1000),
          		MaxBadBinaries:                         dc.GetIntPropertyFilteredByNamespace(dynamicconfig.FrontendMaxBadBinaries, namespace.MaxBadBinaries),
          		DisableListVisibilityByFilter:          dc.GetBoolPropertyFnWithNamespaceFilter(dynamicconfig.DisableListVisibilityByFilter, false),
          		BlobSizeLimitError:                     dc.GetIntPropertyFilteredByNamespace(dynamicconfig.BlobSizeLimitError, 2*1024*1024),
          		BlobSizeLimitWarn:                      dc.GetIntPropertyFilteredByNamespace(dynamicconfig.BlobSizeLimitWarn, 256*1024),
          		ThrottledLogRPS:                        dc.GetIntProperty(dynamicconfig.FrontendThrottledLogRPS, 20),
          		ShutdownDrainDuration:                  dc.GetDurationProperty(dynamicconfig.FrontendShutdownDrainDuration, 0),
          		EnableNamespaceNotActiveAutoForwarding: dc.GetBoolPropertyFnWithNamespaceFilter(dynamicconfig.EnableNamespaceNotActiveAutoForwarding, true),
          		EnableClientVersionCheck:               dc.GetBoolProperty(dynamicconfig.EnableClientVersionCheck, true),

Topic		Replies	Views
Temporal + Aurora Mysql Performance Community Support performance	1	1469	July 12, 2021
Performance guidelines Community Support java-sdk , mysql , performance	14	4006	July 10, 2021
Seeing high latencies between two subsequent activity task executions Community Support java-sdk , cassandra	22	3054	July 19, 2022
Workflow task timed out on GKE Community Support java-sdk , cassandra , metrics	6	1102	June 8, 2022
Performance of SignalWithStartWorkflow() Community Support	12	961	December 3, 2020

RESOURCE_EXHAUSTED: Too many outstanding requests to the service

Related topics