250ms latency for a workflow with 2 empty activities

abhishek.more · March 12, 2025, 8:12pm

I did load testing of self-hosted temporal deployment
temporal version: 1.25
postgres17 m7g8xlarge single az
deployed on aws managed k8s with 12 pods for frontend, matching and history service each
Have configured HPA as well to scale horizontally and ensured it doesn’t hit max replicas
20 pods for worker with 20k max concurrent activity/workflow and 200 max poller for activity/workflow support

load testing, single workflow with 2 empty activities for 50rps, 100rps, 200rps, 300rps
On an average got 250ms latency for workflow completion and 80ms for workflow execute-call-to-schedule. Both of these metrics are custom and not from metrics emitted temporal.

Is it good latencies considering workflow with 2 empty activities?
Could it be improved by using temporal cloud? If yes, then what would be latency with temporal cloud?

Thanks

tihomir · March 16, 2025, 10:54pm

Can you show
shard lock latency
histogram_quantile(0.99, sum by (le) (rate(semaphore_latency_bucket{operation="ShardInfo",service_name="history"}[1m])))

``
service latency
histogram_quantile(0.95, sum(rate(service_latency_bucket{service=“frontend”}[1m])) by (operation, le))
`

db latency
histogram_quantile(0.95, sum(rate(persistence_latency_bucket{}[1m])) by (operation, le))

resource exhausted errors
sum(rate(service_errors_resource_exhausted{}[1m])) by (operation, resource_exhausted_cause)

Is it good latencies considering workflow with 2 empty activities?

i think its hard to tell. understanding your db latency especially is imo important

Could it be improved by using temporal cloud? If yes, then what would be latency with temporal cloud?

docs here can help, but cloud would give you higher throughput and lower latencies in general, see Benchmarking Latency: Temporal Cloud vs. Self-Hosted Temporal | Temporal for example

abhishek.more · March 17, 2025, 9:35am

db latency

service latency

shard lock latency

resource exhausted errors

tihomir · March 17, 2025, 2:33pm

exclude PollWorkflow/ActivityTaskQueue operations from your service latency graph (they are long-poll operations so can take up to 70s)

abhishek.more · March 17, 2025, 7:49pm

Updated service latency graph

abhishek.more · March 26, 2025, 8:26am

@tihomir bumping this up

Topic		Replies	Views
High Activity Latency Community Support	2	509	March 21, 2021
Too high schedule to start latency with fine metrics Community Support go-sdk , helm , general-impl , activity	4	389	March 26, 2024
Seeing high latencies between two subsequent activity task executions Community Support java-sdk , cassandra	22	2882	July 19, 2022
Temporal throughput not improving Community Support cassandra , metrics	2	1101	October 2, 2022
Long gap between activities Community Support go-sdk , helm	1	222	March 25, 2024

250ms latency for a workflow with 2 empty activities

Related topics