How to get best Temporal Performance?

Parth_Mangukiya · January 23, 2024, 10:07am

We have deployed Temporal using Helm-chart and we are using PostgreSQL as persistence DB.
Here is our services configuration:

History: Replica: 5, CPU: 2000m, Memory: 8 Gi
Worker: Replica: 3, CPU: 500m, Memory: 1 Gi
Frontend: Replica: 3, CPU: 500m, Memory: 1 Gi
Matching: Replica: 3, CPU: 500m, Memory: 1 Gi

Worker configuration:
Worker: (Replicas: Min: 1, Max:5, Desired: 2), CPU: 750m, Memory: 1000 Mi

We are using TypeScript SDK and we are able to scrap Prometheus metrics.
I am running one workflow with 5 activities. When I run this, this gets executed in 3.335 seconds.
Now, I am generating load using Grafana K6 and I am load-testing this workflow.

Total 2105 execution get spawned. Out of that 1744 are successful and rest are failure.
Time taken:

820 executions completed in less than a minute
333 executions took 1 to 2 minutes
304 executions took 2 to 3 minutes
249 executions took 3 to 4 minutes
38 executions took more than 4 minutes

Reasons for failure: [I have setup workflowTaskTimeout and Activity startToCloseTimeout to 1 minute)

WorkflowTaskTimeOut
ActivityTaskTimedOut

We have only 1 task-queue, from which Worker is polling.
I am using default worker options. I have not changed anything.

Having Observations from Temporal UI suggests that in the workflows that are taking too much time, there is so much delay between transitions like ActivityTaskScheduled to ActivityTaskStarted and WorkflowTaskScheduled to WorkflowTaskStarted

My Questions:

Why it is taking too much time. All the executions should have been completed successfully in less than 1 minute.
How to fix the timeout of the activity task and workflow task?
What changes should I be making to the Worker options?
Should I make any change to the task-queue? Can the task-queue getting flooded be the reason of slow executions?

Can anybody please help?

tihomir · January 23, 2024, 2:41pm

Whats the TS sdk version you use? How many worker processes do you deploy?

WorkflowTaskTimeOut to 1 minute

this is not typically recommended, is your cpu % on workers very high? if possible keep the default (10s, don’t explicitly set)

let’s start off with service metrics, please share:

persistence latency:
histogram_quantile(0.95, sum(rate(persistence_latency_bucket{}[1m])) by (operation, le))

sync match:
sum(rate(poll_success_sync{}[1m])) / sum(rate(poll_success{}[1m]))

workflow lock contention:
histogram_quantile(0.99, sum(rate(cache_latency_bucket{operation="HistoryCacheGetOrCreate"}[1m])) by (le))

shard lock contention:
histogram_quantile(0.99, sum(rate(lock_latency_bucket{operation="ShardInfo"}[1m])) by (le))

resource exhausted:
sum(rate(service_errors_resource_exhausted{}[1m])) by (operation, resource_exhausted_cause)

Parth_Mangukiya · January 23, 2024, 3:46pm

Hi Tihomir,
Thanks for your quick reply.

Please find details below.
I started the load-test at 3:12:30 PM.
We are using Typescript SDK 1.9.0 version.

We have our own Kubernetes Pods, which works as Workers.
Configurations: (Replicas: Min: 1, Max:5, Desired: 2), CPU: 750m, Memory: 1000 Mi

This was done due to, we were getting WorkflowTaskTimedOut error in the execution. Hence, increased it. But it did not help much really.

tihomir · January 23, 2024, 4:59pm

Can you describe your use case for load test? Basically what do your workflows that you are running do? Is it indicative or close to the real use case you are intending to run on your cluster?

persistence latency seems pretty high. should be in 100s of ms typically, might need to look at size of your db
sync match rate: looks like need to look at your workers next after we are done with server side
workflow lock contention - seems pretty high as in your single executions seem to have a lot of updates (signals? async activities/child wf completing at same time? timers firing at same time? combination of all mentioned?)
shard lock contention - whats numHistoryShards you set in static config? seems we need to try to increase that, how many history hosts are you running? i would first try to look into this and the persistence latencies to try to reduce those first and them move on to next steps
(note if you change numHistoryShards you need to stand up a new cluster, including persistence store (need to re-index))
resource exhausted on BusyWorkflow, this is related to your high workflow lock contention, meaning you are probably starting too many activities/child workflows from single workflow execution or your activities might be all heartbeating at very high rate. typical recommendation could be to start less activities/child workflows concurrently or if issue is heartbeats up the heartbeat timeout by some small value and test again.

Parth_Mangukiya · January 24, 2024, 7:17am

Our use-case for load-test is, we want to see if similar type of dynamic workflows(different and multiple activities) created by end users will get executed efficiently at that load or not. We won’t have dedicated workers for a single workflow. I expect multiple workflows getting executed by a single worker. We will have M Workflow and it can have N number of executions. I want to generate maximum traffic using the load-test so I can check how Temporal will hold up when there are multiple users who are executing multiple workflows.

Here’s my observation, might help:
When I run the load-test, initial executions are completed within split seconds like (3-4 seconds) but next executions take too much time(maybe workers are busy)

If I am not wrong, are you talking about the Persistence Postgres DB Instance where temporal and temporal_visibility dbs are created? then please check the following SS:

We don’t have child workflows or signals as of now, but we do have a number of async activities inside a workflow.

We have numHistoryShards set to 512.
Here is our services configuration:

History: Replica: 5, CPU: 2000m, Memory: 8 Gi
Worker: Replica: 3, CPU: 500m, Memory: 1 Gi
Frontend: Replica: 3, CPU: 500m, Memory: 1 Gi
Matching: Replica: 3, CPU: 500m, Memory: 1 Gi

We are not using child workflows. We are not yet using the activity heartbeat, as I expect the activities to complete in 2-3 seconds.

I have checked minimal-values.yaml file, in which I found the below configurations for my persistence postgres db temporal and temporal-visibility. Does it have anything for slow performance? Should I increase it?

maxConns: 20
maxConnLifetime: "1h"

Topic		Replies	Views
Temporal performance issues Community Support java-sdk , performance , worker , kubernetes	1	1741	April 26, 2023
Workflow Performance with Java SDK Community Support java-sdk	1	707	February 20, 2023
Tuning Temporal setup for better performance Community Support cassandra , performance , kubernetes	5	8623	November 13, 2021
Temporal throughput Community Support general-impl , best-practices	16	5545	January 20, 2025
Improve Performance on production stuff Community Support java-sdk , helm , general-impl , metrics	1	629	October 19, 2022

How to get best Temporal Performance?

Related topics