Temporal throughput

We are doing some testing with Temporal to understand how to tune all the available options to improve the throughput of the system.

We are using three worker nodes to execute tasks, each with 4 cores and 8GB memory. The workflow has 5 local activities, each takes 5-10ms.

We found that the CPU usage is low, 20%. We are unable to increase the throughput (combined throughput 210 approximately) or the CPU usage even after trying different configuration options.

We have modified the following values:
WorkerFactoryOptions.setMaxWorkflowThreadCount - upto 2400
WorkerFactoryOptions.setWorkflowCacheSize - upto 2400
WorkerOptions.setMaxConcurrentActivityExecutionSize - upto 800
WorkerOptions.setMaxConcurrentWorkflowTaskExecutionSize - upto 800
WorkerOptions.setMaxConcurrentLocalActivityExecutionSize - upto 800

WorkerFactoryOptions.setWorkflowHostLocalPollThreadCount - upto 240
WorkerOptions.setWorkflowPollThreadCount - upto 240
WorkerOptions.setActivityPollThreadCount - upto 240

numHistoryShards - 512, default in Helm Chart
matching.numTaskqueueReadPartitions - Unchanged, the default value is 4
matching.numTaskqueueWritePartitions - Unchanged, not sure about the default value

  • Is there any other configuration that can be adjusted to improve the throughput?
  • Are there any recommended settings for the above configurations?
  • What can be a possible bottleneck here?
  • What is the difference between WorkerFactoryOptions.setWorkflowHostLocalPollThreadCount and WorkerOptions.setWorkflowPollThreadCount?

What is the DB CPU utilization?

DB CPU utilization was 80% during the test.

I would compare DB update latency to the service request latency. If they correlate then DB is the bottleneck.

Hi @maxim, few observations from the Cassandra cluster.

  • When there is no Workflow execution happening:
    Reads/sec - 10K (tasks table)
    Writes/sec - Negligible

  • When we are executing Workflows (Workflow has 5 Local Activities) at 200 TPS,
    Reads/sec - 17K (tasks, executions, history_node table)
    Writes/sec - 1.2K LOCAL_QUORUM, 1.6K LWT (executions, history_node, history_tree table table)

We wanted to know if there is any configuration that can bring down the number of reads/writes happening on Cassandra, for example, task polling frequency. When we add a few Normal Activities, this is only going to increase.

Also, is there any recommended setting for the Cassandra cluster if we want to execute 2000 workflows/sec?

Hey @amlanroy1980,

You described the hardware for your Workflow workers and the CPU utilization on Cassandra, but what about the hardware for the actual Temporal Cluster itself? Or are your Workflow workers and the Temporal Cluster co-located on the same hardware?

As far as hardware provisioning for a specific workload, Temporal typically measure throughput in terms of “state transitions” as opposed to just workflows per second. The reason is that workflows per second can vary drastically based on what the Workflows are doing, so it is hard to define what X workflows per second means.

You can measure state transitions for your throughput using the following Server metric (example PromQL here):

 sum (rate(persistence_requests{operation=~"CreateWorkflowExecution|UpdateWorkflowExecution"}[1m]))

Once you measure this for your workload, just multiply by the factor you want to scale your workload up to.

As a general rule of thumb, each vCPU core (as defined by AWS) of Cassandra will give you ~60 state transitions per second. Each vCPU core that is allocated to the Temporal Cluster will give you ~150 state transitions per second. Our recommended Cassandra is setup to use a replication factor of 3.

The second thing to consider is increasing the number of shards. Based on your estimated State Transition load, consider around 3 state transitions per shard. So if your state transition load is estimated to be 10,000, then you want maybe 4,096 shards (rounding up to nearest power of 2).

Finally, can you also tell us what your sync-match rate is? It’s basically the ratio of these two metrics:

sum(rate(poll_success_sync{cluster="$cluster"}[1m]))
sum(rate(poll_success{cluster="$cluster"}[1m]))

This will tell us if there is some artificial bottleneck in terms of workers.

Also curious what these metrics look like for you during the run:

sum (rate(persistence_requests{cluster="$cluster",temporal_service_type="matching", operation="CreateTask"}[1m]))

Thanks,
Manu

Thanks @manu, for the detailed answer, it is very informative. I am trying to collect the metrics that you have asked for. I am facing some problems with running Temporal with external Prometheus.

We are installing Temporal using Helm Chart.

  • What changes will be required in the values.yaml files (prometheus.enabled, etc) to use an existing Prometheus?
  • What changes will be required in prometheus.yml to scrape Temporal metrics?

Please let me know if there is any documentation I can refer to for this.

using an external prometheus you can set prometheus.enabled to false.

if you are deploying prometheus operator in the kubernetes cluster that your temporal cluster is running in you can setup ServiceMonitors with the helm chart: helm-charts/values.yaml at master · temporalio/helm-charts · GitHub including relabeling etc.

if you are using a prometheus not managed by the operator you can configure it to scrape temporal services on port 9090: helm-charts/server-service.yaml at master · temporalio/helm-charts · GitHub

prometheus will need access to that port so if your prometheus is external to your kubernetes cluster you may need to do some extra work to expose that port for each pod outside the cluster (and this is tricky but if you’re doing this already you’ll know what I’m talking about).

as for configuration, we don’t require anything specific - it’ll more depend on if there are any metric / label collisions with other things you’re scraping.

Hi @manu , I am sorry for the delay in replying to your post. It took us some time to collect the metrics you wanted to see. We also did some more testing during this period.

Just a quick summary of what we are doing:

  • We are running performance for a sample workflow that consists of 5 local activities, each of which has a delay of 200ms.
  • The number of history shards is 512, numTaskqueueReadPartitions is set to default value 4.
  • We have seen that we are getting the best throughput with setWorkflowHostLocalPollThreadCount(160), setWorkflowPollThreadCount(160), setActivityPollThreadCount(160).
  • During the tests, we have seen that the CPU usage in the workflow/activity worker is low (<10%). The Cassandra is the bottleneck as the CPU usage goes above 80% and LWT writes latency goes above 100ms.
  • Hardware used
    Temporal History - 2 Core, 2 GiB, 8 instances
    Temporal Matching - 2 Core, 2 GiB, 3 instances
    Temporal Frontend - 2 Core, 2 GiB, 3 instances
    Activity and Workflow Worker - 4 Core, 8 GiB, 3 instances
    Cassandra - 8 Core, 28 GiB, 3 instances

We had a couple of questions:

  • Since we are already using local activities, what other config changes we can make to reduce the load on Cassandra?
  • Is there any optimization you can suggest based on the metrics below?




Hi @manu, not sure if you had an opportunity to look into my previous message. Please let me know if you think we can optimize our settings for better throughput.

Also, is there any option to reduce the load on Cassandra?