Is there a problem if workflowCacheSize > maxWorkflowThreadCount to deal with high latency for GetWorkflowExecution

In PROD, during peak periods when there are a lot of tasks to do, we noticed the latency for “GetWorkflowExecution” jumped very high compared to 20-50ms outside peak periods.

We recently increased shard count from 512 to 8196 and increased our MySQL specs from 16C to 32C but the latency spikes didn’t improve much.

On the worker side, we’ve already reached the upper limit for active thread count and we cannot increase max concurrent settings for activity and workflow tasks any further without triggering alarms at Infra layer.

However, we noticed that memory usage is still very low. Hence, we tried to increase workflowCacheSize to 3x maxWorkflowThreadCount and we saw StateTransition/second improved.

We are planning to increase workflowCacheSize further to help reduce the load on DB for GetWorkflowExecution. I just wanna double check to confirm if there are any serious down side to this approach since official documentation said that.

workflowCacheSize should be ≤ maxWorkflowThreadCount. Each Workflow has at least one Workflow thread.

In addition, if you have some ideas on why GetWorkflowExecution latency could spike so high despite having more shards + better specs OR how to reduce GetWorkflowExecution latency, I’d really love to hear them too! :slight_smile:

why GetWorkflowExecution latency could spike so high

from server metrics can you show your service_requests metric by operation and see what operations spike during this time?
check also service_errors_resource_exhausted by operation and resource_exhausted_cause at those times

from worker metrics, look at temporal_request_failure and temporal_long_request_failure metrics by operation and status_code during spikes too

since official documentation said that

yeah keep cache size <= max thread count, and if you need to increase increase both (keep them the same value)

Hi @tihomir, when you have time to spare, can you describe in more details what would happen behind the scene if cache size > max thread count? What problems does this bring specifically?

During the same periods, above are the operations that spiked above 3000.

We hit some RPS limit for polling and had some trouble with AddWorkflowTask.

As for temporal_request_failure and temporal_long_request_failure, there’s no spike at all during this period.