Temporal throughput

We are doing some testing with Temporal to understand how to tune all the available options to improve the throughput of the system.

We are using three worker nodes to execute tasks, each with 4 cores and 8GB memory. The workflow has 5 local activities, each takes 5-10ms.

We found that the CPU usage is low, 20%. We are unable to increase the throughput (combined throughput 210 approximately) or the CPU usage even after trying different configuration options.

We have modified the following values:
WorkerFactoryOptions.setMaxWorkflowThreadCount - upto 2400
WorkerFactoryOptions.setWorkflowCacheSize - upto 2400
WorkerOptions.setMaxConcurrentActivityExecutionSize - upto 800
WorkerOptions.setMaxConcurrentWorkflowTaskExecutionSize - upto 800
WorkerOptions.setMaxConcurrentLocalActivityExecutionSize - upto 800

WorkerFactoryOptions.setWorkflowHostLocalPollThreadCount - upto 240
WorkerOptions.setWorkflowPollThreadCount - upto 240
WorkerOptions.setActivityPollThreadCount - upto 240

numHistoryShards - 512, default in Helm Chart
matching.numTaskqueueReadPartitions - Unchanged, the default value is 4
matching.numTaskqueueWritePartitions - Unchanged, not sure about the default value

  • Is there any other configuration that can be adjusted to improve the throughput?
  • Are there any recommended settings for the above configurations?
  • What can be a possible bottleneck here?
  • What is the difference between WorkerFactoryOptions.setWorkflowHostLocalPollThreadCount and WorkerOptions.setWorkflowPollThreadCount?

What is the DB CPU utilization?

DB CPU utilization was 80% during the test.

I would compare DB update latency to the service request latency. If they correlate then DB is the bottleneck.

Hi @maxim, few observations from the Cassandra cluster.

  • When there is no Workflow execution happening:
    Reads/sec - 10K (tasks table)
    Writes/sec - Negligible

  • When we are executing Workflows (Workflow has 5 Local Activities) at 200 TPS,
    Reads/sec - 17K (tasks, executions, history_node table)
    Writes/sec - 1.2K LOCAL_QUORUM, 1.6K LWT (executions, history_node, history_tree table table)

We wanted to know if there is any configuration that can bring down the number of reads/writes happening on Cassandra, for example, task polling frequency. When we add a few Normal Activities, this is only going to increase.

Also, is there any recommended setting for the Cassandra cluster if we want to execute 2000 workflows/sec?

Hey @amlanroy1980,

You described the hardware for your Workflow workers and the CPU utilization on Cassandra, but what about the hardware for the actual Temporal Cluster itself? Or are your Workflow workers and the Temporal Cluster co-located on the same hardware?

As far as hardware provisioning for a specific workload, Temporal typically measure throughput in terms of “state transitions” as opposed to just workflows per second. The reason is that workflows per second can vary drastically based on what the Workflows are doing, so it is hard to define what X workflows per second means.

You can measure state transitions for your throughput using the following Server metric (example PromQL here):

 sum (rate(persistence_requests{operation=~"CreateWorkflowExecution|UpdateWorkflowExecution"}[1m]))

Once you measure this for your workload, just multiply by the factor you want to scale your workload up to.

As a general rule of thumb, each vCPU core (as defined by AWS) of Cassandra will give you ~60 state transitions per second. Each vCPU core that is allocated to the Temporal Cluster will give you ~150 state transitions per second. Our recommended Cassandra is setup to use a replication factor of 3.

The second thing to consider is increasing the number of shards. Based on your estimated State Transition load, consider around 3 state transitions per shard. So if your state transition load is estimated to be 10,000, then you want maybe 4,096 shards (rounding up to nearest power of 2).

Finally, can you also tell us what your sync-match rate is? It’s basically the ratio of these two metrics:

sum(rate(poll_success_sync{cluster="$cluster"}[1m]))
sum(rate(poll_success{cluster="$cluster"}[1m]))

This will tell us if there is some artificial bottleneck in terms of workers.

Also curious what these metrics look like for you during the run:

sum (rate(persistence_requests{cluster="$cluster",temporal_service_type="matching", operation="CreateTask"}[1m]))

Thanks,
Manu

Thanks @manu, for the detailed answer, it is very informative. I am trying to collect the metrics that you have asked for. I am facing some problems with running Temporal with external Prometheus.

We are installing Temporal using Helm Chart.

  • What changes will be required in the values.yaml files (prometheus.enabled, etc) to use an existing Prometheus?
  • What changes will be required in prometheus.yml to scrape Temporal metrics?

Please let me know if there is any documentation I can refer to for this.

using an external prometheus you can set prometheus.enabled to false.

if you are deploying prometheus operator in the kubernetes cluster that your temporal cluster is running in you can setup ServiceMonitors with the helm chart: helm-charts/values.yaml at master · temporalio/helm-charts · GitHub including relabeling etc.

if you are using a prometheus not managed by the operator you can configure it to scrape temporal services on port 9090: helm-charts/server-service.yaml at master · temporalio/helm-charts · GitHub

prometheus will need access to that port so if your prometheus is external to your kubernetes cluster you may need to do some extra work to expose that port for each pod outside the cluster (and this is tricky but if you’re doing this already you’ll know what I’m talking about).

as for configuration, we don’t require anything specific - it’ll more depend on if there are any metric / label collisions with other things you’re scraping.

Hi @manu , I am sorry for the delay in replying to your post. It took us some time to collect the metrics you wanted to see. We also did some more testing during this period.

Just a quick summary of what we are doing:

  • We are running performance for a sample workflow that consists of 5 local activities, each of which has a delay of 200ms.
  • The number of history shards is 512, numTaskqueueReadPartitions is set to default value 4.
  • We have seen that we are getting the best throughput with setWorkflowHostLocalPollThreadCount(160), setWorkflowPollThreadCount(160), setActivityPollThreadCount(160).
  • During the tests, we have seen that the CPU usage in the workflow/activity worker is low (<10%). The Cassandra is the bottleneck as the CPU usage goes above 80% and LWT writes latency goes above 100ms.
  • Hardware used
    Temporal History - 2 Core, 2 GiB, 8 instances
    Temporal Matching - 2 Core, 2 GiB, 3 instances
    Temporal Frontend - 2 Core, 2 GiB, 3 instances
    Activity and Workflow Worker - 4 Core, 8 GiB, 3 instances
    Cassandra - 8 Core, 28 GiB, 3 instances

We had a couple of questions:

  • Since we are already using local activities, what other config changes we can make to reduce the load on Cassandra?
  • Is there any optimization you can suggest based on the metrics below?




Hi @manu, not sure if you had an opportunity to look into my previous message. Please let me know if you think we can optimize our settings for better throughput.

Also, is there any option to reduce the load on Cassandra?

Hi,

Sorry for the late response.

Based on your graphs, you have a high sync-match rate, which means you have sufficient workers to process the load. The relatively low CreateTask rate also backs this up.

It is not clear to me that you will be able to further reduce the load on Cassandra for the given Workload.

From a hardware point of view though, I would have expected you to be able to get about 2x the throughput for 80% utilization.

Are you able to share the test Workflow code that you are running? Would be more than happy to try and execute it ourselves and see what kind of throughput we are getting.

Can you also report the P95 latency you are seeing for the UpdateWorkflowExecution persistence call?

histogram_quantile(0.95, sum(rate(persistence_latency_bucket{operation="UpdateWorkflowExecution"}[1m])) by (operation, le))

Thanks,
Manu

Also, is there anything you can tell us about Disk Queue Length / Provisioned IOPS for the storage backing the Cassandra nodes?

Hi @manu,
Thank you for the reply. We repeated the performance test using the same Workflow, this time in an environment that is very similar to our production setup.

  • Hardware used
    Temporal History - 4 Core, 12 GiB, 8 instances
    Temporal Matching - 4 Core, 12 GiB, 4 instances
    Temporal Frontend - 4 Core, 4 GiB, 4 instances
    Activity and Workflow Worker - 4 Core, 8 GiB, 6 instances
    Cassandra - 16 Core, 112 GiB, 6 instances
  • matching.numTaskqueueReadPartitions - 300
  • numHistoryShards - 8192
  • setWorkflowHostLocalPollThreadCount(160), setWorkflowPollThreadCount(160), setActivityPollThreadCount(160)

Now we can execute 1000 workflows per second.

  • Also, for this, I can see 16k LOCAL_QUORUM reads, 3.5k LOCAL_QUORUM writes, 4.8k LOCAL_SERIAL writes. Do these numbers look fine? Our Workflow consists of 5 Local Activities, each with an artificial delay of 200ms.
  • I am sending the metrics for this new test. Please let me know if we can change any configuration based on these numbers. In the meantime, I will clean up the code and share it with you.





you do not need 300 task queue partitions, default 4 should do the work, if not enough, try 8

for CreateTask persistence API call, if you change the above task queue partition, it should drop (meaning using less DB capacity)

poll_success_sync / poll_success should be as close to 1 as possible, in practice, this number should be > 0.9 or 0.95