Running Temporal + Postgres - Benchmark

Hello,

We have begun to test Temporal in our staging environment, running within Kubernetes with a Postgres 12 database. We simply want to better understand how to run Temporal, the workers and find the bottleneck of the whole system. We have briefly outlined our results in the benchmark below and we would like to discuss its outcome.

Benchmark

For the first benchmark, we want to see how Temporal performs with limited resources. Our aim is to see how close we can get to a throughput of 100 workflows/second with a reasonable average schedule to finish time for the workflows of less than 30s (for simple workflows that do not wait for user input).

For this experiment, we are using a basic implementation of a DSL interpreter workflow, which takes as input a json with a list of activities that are to be called. We have 5 actions implemented within the same activity, that only have one logging instruction inside. In the workflow implementation we are logging the workflow start and end.

Environment

We are running Temporal with Kubernetes, we don’t use Helm, but we have defined our own k8s deployment files.

We are not aiming for resilience, but instead we run just one service per type (frontend, matching, history, worker). Each has the memory allocation limited to 2Gb. We have enough CPU cores available and we do not consider any limits here. If during the experiment the CPU usage reaches the maximum available within our cluster we will reconsider this assumption.

For our workers, we have run multiple scenarios, where a worker is handling both the Workflows and Activities on the same task list, and when we used separate workers for Workflows and Activities. In both scenarios, the workers had the memory allocation limited to 2Gb.

As we will be running Temporal in AWS, we first researched if we can make use of AWS Keyspaces instead of managing our own Cassandra instance. Because this option is not available at this time, we have decided to run Temporal with Postgres 12 for as long as we can scale it at a reasonable cost and wait for AWS Keyspaces compatibility.

Implementation

We have a simple Java DSL Interpreter Workflow implementation that at this point in time it parses a list of activities and calls them by name:


…

    @Override

    public void run(String dslInput) {

            Dsl dsl = mapper.readValue(dslInput, Dsl.class);

…

        for (DslActivity dslActivity : dsl.getActivities()) {

            Workflow.getLogger(this.getClass().getSimpleName())

                .info(

                    "Calling activity {} from workflow {} {}",

                    dslActivity.getName(),

                    Workflow.getInfo().getWorkflowId(),

                    Workflow.getInfo().getRunId()

                );

            activity.execute(dslActivity.getName(), String.class, dslActivity.getInput());

        }

…

}

The activities don’t do anything, just log their execution. We realise that this is a bit of a stretch that will put more load on Temporal. We could put some delays within the activity execution to simulate some business logic execution, but we are not doing that at this point.

Configuration

We are running Temporal with numHistoryShards: 512

For the workers we doubled the parameters in the default configuration for the factory and 4 times for the worker:


            - name: TEMPORAL_FACTORY_WORKFLOW_CACHE_SIZE

              value: "1200"

            - name: TEMPORAL_FACTORY_MAX_WORKFLOW_THREAD_COUNT

              value: "1200"

            - name: TEMPORAL_FACTORY_WORKFLOW_HOST_LOCAL_POLL_THREAD_COUNT

              value: "10"

            - name: TEMPORAL_WORKER_ACTIVITY_POLL_THREAD_COUNT

              value: "20"

            - name: TEMPORAL_WORKER_WORKFLOW_THREAD_COUNT

              value: "8"

            - name: TEMPORAL_WORKER_MAX_CONCURRENT_ACTIVITY_EXECUTION_SIZE

              value: "800"

            - name: TEMPORAL_WORKER_MAX_CONCURRENT_LOCAL_ACTIVITY_EXECUTION_SIZE

              value: "800"

            - name: TEMPORAL_WORKER_MAX_CONCURRENT_WORKFLOW_TASK_EXECUTION_SIZE

              value: "800"

Results

We are running the experiment as following:

  1. Initiate 2000 workflow executions without any worker alive
  2. Start the Workflow worker and wait until all workflows are processes

We are looking for the following metrics to evaluate the results

  1. WPS: average workflow terminations / second
  2. APS: average activity completions / second

Additionally, we also look at the standard deviation for the above metrics, the delay in activity and workflows execution, but for the sake of simplicity, we are omitting them here.

On the resources utilisation side, we are monitoring the DB, Temporal and Workers CPU and Memory utilisation.

DB - 2vCores, 8Gb Memory

In this case, we are basically reaching 100% CPU utilisation in any configuration: single worker, separate workflow/activity workers or even multiple workflow workers. We would have expected that using multiple workers would allow for a better cache utilisation, and therefore, most of the workflow objects would sit in cache, but instead we did not see any performance increase (maybe even a slight decrease).

We have measures WPS in the interval [8, 10] and APS in the interval [40-50] during multiple runs.

DB - 4vCores, 16Gb Memory

We doubled the size of our database, case in which we have tried the following scenarios:

  1. 1x workflow, 1x activity workers:
    WPS - 10.9
    DB CPU utilisation - 40%

  2. 2x workflow, 2x activity workers:
    WPS - 16
    DB CPU utilisation - 80%

  3. 2x workflow, 4x activity workers:
    WPS - 14.5
    DB CPU utilisation - 80%

The results lead us to believe that the DB can handle more concurrent workflow executions, therefore, running more workflow workers increases the throughput. As expected, more activity workers do not improve the results, as they don’t do any complex work.

Discussion

We would have expected a better throughput for this database size both in terms of WPS and APS. Our investigation into the database performance has highlighted a single query makes most of the load onto the DB:


INSERT INTO history_node (shard_id, tree_id, branch_id, node_id, txn_id, data, data_encoding) VALUES ($1, $2, $3, $4, $5, $6, $7)

In order to mitigate its impact and to reduce its time to wait for locks, we have tried partitioning the history_node table by shard_id (creating 512 partitions). Unfortunately, instead of seeing a performance increase, we have noticed a 20-30% decrease in WPS. This is very curious and we did not yet understand why this happens.

To follow-up on the results above, a few questions arise:

  1. Do you have any insights on why partitioning the history_node did not improve performance? This may be just Postgres not handling the partitioning correctly. Did you also try this approach?

  2. Do you have some numbers(wf/s, activities/s) available for running different Cassandra cluster sizes? For us, it would be interesting to compare a small Cassandra cluster with the Postgres in terms of performance. Although, the decision to manage Cassandra ourselves would be a hard sell at the moment.

  3. Do you have some guidelines onto when is it better to start scaling workers horizontally rather than vertically? From out tests we notice that it is worth increasing the configuration for the Workflow workers to get better performance, rather than running 2-4 workers with the default configuration.

Hi,

Thanks for trying Temporal.

Few things to consider:

  1. Call activity asynchronously: https://docs.temporal.io/docs/java-implementing-workflows#calling-activities-asynchronously
  2. CPU utilization of DB / services should not reach 70%, as general best practice
  3. Could you also provide the CPU / mem utilization of service hosts? (frontend / matching / history)
  4. Services, especially history service should have more memory
    NOTE: history service have in-memory cache, which benefits from more memory, smaller memory will force the history server to reload from DB more frequently
  5. Adding number of history shards will increase the upper limit of the Temporal, but the corresponding DB / server hosts also needs to be scaled accordingly.
  6. INSERT INTO history_node (shard_id, tree_id, branch_id, node_id, txn_id, data, data_encoding) VALUES ($1, $2, $3, $4, $5, $6, $7) this query is persisting the progress of the workflows.

Hey Andrei,

Can you share your Postgres configuration scripts? We are trying to get out of Cassandra, because it’s too resource intensive for medium size systems.