In PROD, during peak periods when there are a lot of tasks to do, we noticed the latency for “GetWorkflowExecution” jumped very high compared to 20-50ms outside peak periods.
We recently increased shard count from 512 to 8196 and increased our MySQL specs from 16C to 32C but the latency spikes didn’t improve much.
On the worker side, we’ve already reached the upper limit for active thread count and we cannot increase max concurrent settings for activity and workflow tasks any further without triggering alarms at Infra layer.
However, we noticed that memory usage is still very low. Hence, we tried to increase workflowCacheSize
to 3x maxWorkflowThreadCount
and we saw StateTransition/second improved.
We are planning to increase workflowCacheSize
further to help reduce the load on DB for GetWorkflowExecution
. I just wanna double check to confirm if there are any serious down side to this approach since official documentation said that.
workflowCacheSize should be ≤ maxWorkflowThreadCount. Each Workflow has at least one Workflow thread.
In addition, if you have some ideas on why GetWorkflowExecution
latency could spike so high despite having more shards + better specs OR how to reduce GetWorkflowExecution
latency, I’d really love to hear them too!