Why is there a high latency to start a workflow

Hi everyone!

I am currently doing stress testing on our application which uses Temporal for micro-service orchestration.

I have some application which have one workflow which contains one activity and one async child workflow (which contains one activity).
My worker configured with options:

worker.SetStickyWorkflowCacheSize(50000 * 1000 * 10)
workerOptions := worker.Options{
		MaxConcurrentWorkflowTaskPollers:       100,
		MaxConcurrentActivityExecutionSize:     1000,
		MaxConcurrentActivityTaskPollers:       100,
		MaxConcurrentWorkflowTaskExecutionSize: 1000,

Temporal server started like there with PostgreSQL.
I start 50 threads that send a request to create a workflow for 10 minutes and have high latency StartWorkflowExecution. As result I have low request throughput.

I read best practices and it didn’t help me.
Why is RPC latency high?

Can you check your server sync/match rate?

sum(rate(poll_success_sync{namespace="namespace_name"}[1m])) / sum(rate(poll_success{namespace="namespace_name"}[1m]))

and see if its lower than 95%. This could indicate you need more workers for the stress load.

Also can check number of persistence requests:

@tihomir thanks for you reply!
Metrics look good except RPC latency.
Снимок экрана 2022-09-18 в 18.15.26

What can I do to reduce latency and find the bottleneck?

Seems the latencies revolve around starting your workflow executions and responding to task completions. In your load test how many workflows/activities are you starting simultaneously? How many worker processes do you have?

Couple of things to check:

Check service_errors_resource_exhausted to see if you are hitting any rps limits (also look at your frontend logs)

Check task_latency_userlatency metric which are latencies that can be introduced by your app code and if high could cause workflow lock contention (this could happen if you start a large number of activities or workflows / child workflows simultaneously).

Along with userlatency check cache latency, which if high can also indicate you might be starting too many workflows/activities simultaneously:
histogram_quantile(0.99, sum(rate(cache_latency_bucket{operation="HistoryCacheGetOrCreate"}[1m])) by (le))

Finally could also check if you may need to increase the number of shards:
histogram_quantile(0.99, sum(rate(lock_latency_bucket{operation="ShardInfo"}[1m])) by (le))
You would need to rebuild your cluster (and wipe data) each time you change number of shards.