Why is there a high latency to start a workflow

Hi everyone!

I am currently doing stress testing on our application which uses Temporal for micro-service orchestration.

I have some application which have one workflow which contains one activity and one async child workflow (which contains one activity).
My worker configured with options:

worker.SetStickyWorkflowCacheSize(50000 * 1000 * 10)
workerOptions := worker.Options{
		MaxConcurrentWorkflowTaskPollers:       100,
		MaxConcurrentActivityExecutionSize:     1000,
		MaxConcurrentActivityTaskPollers:       100,
		MaxConcurrentWorkflowTaskExecutionSize: 1000,
	}

Temporal server started like there with PostgreSQL.
I start 50 threads that send a request to create a workflow for 10 minutes and have high latency StartWorkflowExecution. As result I have low request throughput.


I read best practices and it didn’t help me.
Why is RPC latency high?

1 Like

Can you check your server sync/match rate?

sum(rate(poll_success_sync{namespace="namespace_name"}[1m])) / sum(rate(poll_success{namespace="namespace_name"}[1m]))

and see if its lower than 95%. This could indicate you need more workers for the stress load.

Also can check number of persistence requests:
sum(rate(persistence_requests{operation="CreateTask",namespace="namespace_name"}[1m]))

@tihomir thanks for you reply!
Metrics look good except RPC latency.
Π‘Π½ΠΈΠΌΠΎΠΊ экрана 2022-09-18 Π² 18.15.26


@tihomir
What can I do to reduce latency and find the bottleneck?

Seems the latencies revolve around starting your workflow executions and responding to task completions. In your load test how many workflows/activities are you starting simultaneously? How many worker processes do you have?

Couple of things to check:

Check service_errors_resource_exhausted to see if you are hitting any rps limits (also look at your frontend logs)

Check task_latency_userlatency metric which are latencies that can be introduced by your app code and if high could cause workflow lock contention (this could happen if you start a large number of activities or workflows / child workflows simultaneously).

Along with userlatency check cache latency, which if high can also indicate you might be starting too many workflows/activities simultaneously:
histogram_quantile(0.99, sum(rate(cache_latency_bucket{operation="HistoryCacheGetOrCreate"}[1m])) by (le))

Finally could also check if you may need to increase the number of shards:
histogram_quantile(0.99, sum(rate(lock_latency_bucket{operation="ShardInfo"}[1m])) by (le))
You would need to rebuild your cluster (and wipe data) each time you change number of shards.

1 Like

@tihomir how does one bring down the task_latency_userlatency in cases where starting large number of activities/workflows simultaneously is indeed a requirement?

can be introduced by your app code

We are spawning 100s of workflows from inside a single activity of a parent workflow, and we have 10s of such activities to concurrently start 1000s of workflows at once. Is this a wrong pattern? Note that these aren’t child workflows

What is the best way to spawn 1000s of sub-workflows from inside a workflow if not via activities?

It is OK to spawn them from activities. But I would recommend adding some rate limiting to these start calls. Instead of starting all of them at once, start some fixed number of them per second.