Is there a problem if workflowCacheSize > maxWorkflowThreadCount to deal with high latency for GetWorkflowExecution

JamesTran · March 15, 2025, 9:43am

In PROD, during peak periods when there are a lot of tasks to do, we noticed the latency for “GetWorkflowExecution” jumped very high compared to 20-50ms outside peak periods.

We recently increased shard count from 512 to 8196 and increased our MySQL specs from 16C to 32C but the latency spikes didn’t improve much.

On the worker side, we’ve already reached the upper limit for active thread count and we cannot increase max concurrent settings for activity and workflow tasks any further without triggering alarms at Infra layer.

However, we noticed that memory usage is still very low. Hence, we tried to increase workflowCacheSize to 3x maxWorkflowThreadCount and we saw StateTransition/second improved.

We are planning to increase workflowCacheSize further to help reduce the load on DB for GetWorkflowExecution. I just wanna double check to confirm if there are any serious down side to this approach since official documentation said that.

workflowCacheSize should be ≤ maxWorkflowThreadCount. Each Workflow has at least one Workflow thread.

In addition, if you have some ideas on why GetWorkflowExecution latency could spike so high despite having more shards + better specs OR how to reduce GetWorkflowExecution latency, I’d really love to hear them too!

tihomir · March 16, 2025, 10:40pm

why GetWorkflowExecution latency could spike so high

from server metrics can you show your service_requests metric by operation and see what operations spike during this time?
check also service_errors_resource_exhausted by operation and resource_exhausted_cause at those times

from worker metrics, look at temporal_request_failure and temporal_long_request_failure metrics by operation and status_code during spikes too

since official documentation said that

yeah keep cache size <= max thread count, and if you need to increase increase both (keep them the same value)

JamesTran · March 17, 2025, 9:26am

Hi @tihomir, when you have time to spare, can you describe in more details what would happen behind the scene if cache size > max thread count? What problems does this bring specifically?

During the same periods, above are the operations that spiked above 3000.

We hit some RPS limit for polling and had some trouble with AddWorkflowTask.

As for temporal_request_failure and temporal_long_request_failure, there’s no spike at all during this period.

Topic		Replies	Views
Does high QPS against Workflow's Query methods effect thread count limit? Community Support java-sdk	2	700	October 22, 2021
Workflow Task Schedule To Start Latency High Community Support java-sdk , deployment	11	3910	February 8, 2025
Workflow Performance with Java SDK Community Support java-sdk	1	740	February 20, 2023
Very big schedule to start workflow latency (Java SDK) Community Support java-sdk	10	3052	March 22, 2024
Very high Workflow Task Schedule To Start Latency Community Support	0	225	July 13, 2024

Is there a problem if workflowCacheSize > maxWorkflowThreadCount to deal with high latency for GetWorkflowExecution

Related topics