How to get best Temporal Performance?

We have deployed Temporal using Helm-chart and we are using PostgreSQL as persistence DB.
Here is our services configuration:

History: Replica: 5, CPU: 2000m, Memory: 8 Gi
Worker: Replica: 3, CPU: 500m, Memory: 1 Gi
Frontend: Replica: 3, CPU: 500m, Memory: 1 Gi
Matching: Replica: 3, CPU: 500m, Memory: 1 Gi

Worker configuration:
Worker: (Replicas: Min: 1, Max:5, Desired: 2), CPU: 750m, Memory: 1000 Mi

We are using TypeScript SDK and we are able to scrap Prometheus metrics.
I am running one workflow with 5 activities. When I run this, this gets executed in 3.335 seconds.
Now, I am generating load using Grafana K6 and I am load-testing this workflow.

Total 2105 execution get spawned. Out of that 1744 are successful and rest are failure.
Time taken:

  • 820 executions completed in less than a minute
  • 333 executions took 1 to 2 minutes
  • 304 executions took 2 to 3 minutes
  • 249 executions took 3 to 4 minutes
  • 38 executions took more than 4 minutes

Reasons for failure: [I have setup workflowTaskTimeout and Activity startToCloseTimeout to 1 minute)

  1. WorkflowTaskTimeOut
  2. ActivityTaskTimedOut

We have only 1 task-queue, from which Worker is polling.
I am using default worker options. I have not changed anything.

Having Observations from Temporal UI suggests that in the workflows that are taking too much time, there is so much delay between transitions like ActivityTaskScheduled to ActivityTaskStarted and WorkflowTaskScheduled to WorkflowTaskStarted

My Questions:

  1. Why it is taking too much time. All the executions should have been completed successfully in less than 1 minute.
  2. How to fix the timeout of the activity task and workflow task?
  3. What changes should I be making to the Worker options?
  4. Should I make any change to the task-queue? Can the task-queue getting flooded be the reason of slow executions?

Can anybody please help?

1 Like

Whats the TS sdk version you use? How many worker processes do you deploy?

WorkflowTaskTimeOut to 1 minute

this is not typically recommended, is your cpu % on workers very high? if possible keep the default (10s, don’t explicitly set)

let’s start off with service metrics, please share:

persistence latency:
histogram_quantile(0.95, sum(rate(persistence_latency_bucket{}[1m])) by (operation, le))

sync match:
sum(rate(poll_success_sync{}[1m])) / sum(rate(poll_success{}[1m]))

workflow lock contention:
histogram_quantile(0.99, sum(rate(cache_latency_bucket{operation="HistoryCacheGetOrCreate"}[1m])) by (le))

shard lock contention:
histogram_quantile(0.99, sum(rate(lock_latency_bucket{operation="ShardInfo"}[1m])) by (le))

resource exhausted:
sum(rate(service_errors_resource_exhausted{}[1m])) by (operation, resource_exhausted_cause)

Hi Tihomir,
Thanks for your quick reply.

Please find details below.
I started the load-test at 3:12:30 PM.
We are using Typescript SDK 1.9.0 version.

We have our own Kubernetes Pods, which works as Workers.
Configurations: (Replicas: Min: 1, Max:5, Desired: 2), CPU: 750m, Memory: 1000 Mi

This was done due to, we were getting WorkflowTaskTimedOut error in the execution. Hence, increased it. But it did not help much really.

Can you describe your use case for load test? Basically what do your workflows that you are running do? Is it indicative or close to the real use case you are intending to run on your cluster?

  • persistence latency seems pretty high. should be in 100s of ms typically, might need to look at size of your db

  • sync match rate: looks like need to look at your workers next after we are done with server side

  • workflow lock contention - seems pretty high as in your single executions seem to have a lot of updates (signals? async activities/child wf completing at same time? timers firing at same time? combination of all mentioned?)

  • shard lock contention - whats numHistoryShards you set in static config? seems we need to try to increase that, how many history hosts are you running? i would first try to look into this and the persistence latencies to try to reduce those first and them move on to next steps
    (note if you change numHistoryShards you need to stand up a new cluster, including persistence store (need to re-index))

  • resource exhausted on BusyWorkflow, this is related to your high workflow lock contention, meaning you are probably starting too many activities/child workflows from single workflow execution or your activities might be all heartbeating at very high rate. typical recommendation could be to start less activities/child workflows concurrently or if issue is heartbeats up the heartbeat timeout by some small value and test again.

Our use-case for load-test is, we want to see if similar type of dynamic workflows(different and multiple activities) created by end users will get executed efficiently at that load or not. We won’t have dedicated workers for a single workflow. I expect multiple workflows getting executed by a single worker. We will have M Workflow and it can have N number of executions. I want to generate maximum traffic using the load-test so I can check how Temporal will hold up when there are multiple users who are executing multiple workflows.

Here’s my observation, might help:
When I run the load-test, initial executions are completed within split seconds like (3-4 seconds) but next executions take too much time(maybe workers are busy)

If I am not wrong, are you talking about the Persistence Postgres DB Instance where temporal and temporal_visibility dbs are created? then please check the following SS:

We don’t have child workflows or signals as of now, but we do have a number of async activities inside a workflow.

We have numHistoryShards set to 512.
Here is our services configuration:

History: Replica: 5, CPU: 2000m, Memory: 8 Gi
Worker: Replica: 3, CPU: 500m, Memory: 1 Gi
Frontend: Replica: 3, CPU: 500m, Memory: 1 Gi
Matching: Replica: 3, CPU: 500m, Memory: 1 Gi

We are not using child workflows. We are not yet using the activity heartbeat, as I expect the activities to complete in 2-3 seconds.

I have checked minimal-values.yaml file, in which I found the below configurations for my persistence postgres db temporal and temporal-visibility. Does it have anything for slow performance? Should I increase it?

maxConns: 20
maxConnLifetime: "1h"
1 Like