We are currently doing stress testing on our application which uses Temporal for micro-service orchestration.
Our current setup is a GKE cluster as defined below -
2 T2D Nodes with 32 cores, 64 GB memory having
8 application pods - with each pod having 50 workflow pollers, 50 acitivity pollers, 200 maxConcurrentWorkflowTaskExecutionSize, 200 maxConcurrentAcitivityTaskExecutionSize, 600 maxWorkflowThreadCount, 600 workflowCachesize
4 Frontend pods, 8 history pods, 4 matching pods, 1 worker pod, 8 matching queue partitions, 4K Shards
All application and temporal pods on the same 2 nodes mentioned above.
5 T2D 16 core, 64 GB Nodes hosting 5 cassandra pods set to 3 replication factor and local quorum consistency. All application and database nodes are in the same zone.
We are executing a workflow with 1 local activity & 3 activities. Our microservices take upto 150ms in the overall end-to-end latency. The end-to-end latencies noted of the workflow are as follows -
10 TPS - avg=247.5ms min=232.38ms med=244.97ms max=466.14ms p(90)=255.34ms p(95)=257.52ms
50 TPS - avg=237.03ms min=220ms med=232.71ms max=645.33ms p(90)=244ms p(95)=252.37ms
100 TPS - avg=235.01ms min=214.71ms med=231.74ms max=633.84ms p(90)=245.9ms p(95)=254.31ms
150 TPS - avg=243.06ms min=216.82ms med=233.8ms max=731.17ms p(90)=260.86ms p(95)=275.8ms
We notices degradation in latency post this point -
200 TPS - avg=262.14ms min=220.5ms med=246.04ms max=891.28ms p(90)=298.95ms p(95)=352.7ms
And just starts to bottleneck at 250 TPS - avg=1.33s min=222.38ms med=261.42ms max=11.25s p(90)=5.26s p(95)=5.69s
The workflow schedule to start latency, workflow task execution latency, activity schedule to start latency and activity execution latency seem to be doing ok from the metrics seen below. But as you can see end to end latency is in seconds.
From the traces what can be seen is the latency between end of a runActivity to startActivity seems to be increasing wildly as we increase throughput. Which happens to be the workflow task schedule to start and workflow task execution latencies but they seem well within a few milliseconds.
Service latencies and persistence latencies also seem to be doing fine -
As the latencies grow, we see sync match rate also drops to a low.
We are also seeing some sticky cache evictions but unsure if that is cause of bottleneck or the bottleneck causing that. But at 250 TPS sprayed across 8 pods of application, with having 600 cachesize, I dont see that as the cause for the bottleneck.
We have tried tweaking all the params suggested by the team, seem to hit this bottleneck at 250 tps with all the pods underutilized and nodes hitting only 50% cpu. utilization. Need your help in debugging this, we need to scale throughput without deterioration in latency upto 60% utilization of the nodes. And we plan to add more nodes to scale this setup horizontally. We also need to understand why there is degradation between 150 & 200 and tune that as well.