Why is there a high latency to start a workflow

andrey · September 15, 2022, 1:02pm

Hi everyone!

I am currently doing stress testing on our application which uses Temporal for micro-service orchestration.

I have some application which have one workflow which contains one activity and one async child workflow (which contains one activity).
My worker configured with options:

worker.SetStickyWorkflowCacheSize(50000 * 1000 * 10)
workerOptions := worker.Options{
		MaxConcurrentWorkflowTaskPollers:       100,
		MaxConcurrentActivityExecutionSize:     1000,
		MaxConcurrentActivityTaskPollers:       100,
		MaxConcurrentWorkflowTaskExecutionSize: 1000,
	}

Temporal server started like there with PostgreSQL.
I start 50 threads that send a request to create a workflow for 10 minutes and have high latency StartWorkflowExecution. As result I have low request throughput.

I read best practices and it didn’t help me.
Why is RPC latency high?

tihomir · September 15, 2022, 10:05pm

Can you check your server sync/match rate?

sum(rate(poll_success_sync{namespace="namespace_name"}[1m])) / sum(rate(poll_success{namespace="namespace_name"}[1m]))

and see if its lower than 95%. This could indicate you need more workers for the stress load.

Also can check number of persistence requests:
sum(rate(persistence_requests{operation="CreateTask",namespace="namespace_name"}[1m]))

andrey · September 18, 2022, 2:17pm

@tihomir thanks for you reply!
Metrics look good except RPC latency.
Снимок экрана 2022-09-18 в 18.15.26

andrey · September 19, 2022, 7:26am

@tihomir
What can I do to reduce latency and find the bottleneck?

tihomir · September 19, 2022, 1:49pm

Seems the latencies revolve around starting your workflow executions and responding to task completions. In your load test how many workflows/activities are you starting simultaneously? How many worker processes do you have?

Couple of things to check:

Check service_errors_resource_exhausted to see if you are hitting any rps limits (also look at your frontend logs)

Check task_latency_userlatency metric which are latencies that can be introduced by your app code and if high could cause workflow lock contention (this could happen if you start a large number of activities or workflows / child workflows simultaneously).

Along with userlatency check cache latency, which if high can also indicate you might be starting too many workflows/activities simultaneously:
histogram_quantile(0.99, sum(rate(cache_latency_bucket{operation="HistoryCacheGetOrCreate"}[1m])) by (le))

Finally could also check if you may need to increase the number of shards:
histogram_quantile(0.99, sum(rate(lock_latency_bucket{operation="ShardInfo"}[1m])) by (le))
You would need to rebuild your cluster (and wipe data) each time you change number of shards.

Dhiraj_Bhakta · January 22, 2023, 2:59pm

@tihomir how does one bring down the task_latency_userlatency in cases where starting large number of activities/workflows simultaneously is indeed a requirement?

can be introduced by your app code

We are spawning 100s of workflows from inside a single activity of a parent workflow, and we have 10s of such activities to concurrently start 1000s of workflows at once. Is this a wrong pattern? Note that these aren’t child workflows

What is the best way to spawn 1000s of sub-workflows from inside a workflow if not via activities?

maxim · January 22, 2023, 5:45pm

It is OK to spawn them from activities. But I would recommend adding some rate limiting to these start calls. Instead of starting all of them at once, start some fixed number of them per second.

Topic		Replies	Views
High latency on Starting Workflow/Activity execution Community Support java-sdk , metrics , latency	7	1585	March 15, 2023
Very high Workflow Task Schedule To Start Latency Community Support	0	248	July 13, 2024
Workflow Task Schedule To Start Latency High Community Support java-sdk , deployment	11	4087	February 8, 2025
Temporal is slow to start burst of 1000s of workflows Server Deployment go-sdk	0	119	January 22, 2025
Troubles shoot of workflow execution latency Community Support performance	1	894	August 11, 2022

Why is there a high latency to start a workflow

Related topics