Hello,
We have begun to test Temporal in our staging environment, running within Kubernetes with a Postgres 12 database. We simply want to better understand how to run Temporal, the workers and find the bottleneck of the whole system. We have briefly outlined our results in the benchmark below and we would like to discuss its outcome.
Benchmark
For the first benchmark, we want to see how Temporal performs with limited resources. Our aim is to see how close we can get to a throughput of 100 workflows/second with a reasonable average schedule to finish time for the workflows of less than 30s (for simple workflows that do not wait for user input).
For this experiment, we are using a basic implementation of a DSL interpreter workflow, which takes as input a json with a list of activities that are to be called. We have 5 actions implemented within the same activity, that only have one logging instruction inside. In the workflow implementation we are logging the workflow start and end.
Environment
We are running Temporal with Kubernetes, we don’t use Helm, but we have defined our own k8s deployment files.
We are not aiming for resilience, but instead we run just one service per type (frontend, matching, history, worker). Each has the memory allocation limited to 2Gb. We have enough CPU cores available and we do not consider any limits here. If during the experiment the CPU usage reaches the maximum available within our cluster we will reconsider this assumption.
For our workers, we have run multiple scenarios, where a worker is handling both the Workflows and Activities on the same task list, and when we used separate workers for Workflows and Activities. In both scenarios, the workers had the memory allocation limited to 2Gb.
As we will be running Temporal in AWS, we first researched if we can make use of AWS Keyspaces instead of managing our own Cassandra instance. Because this option is not available at this time, we have decided to run Temporal with Postgres 12 for as long as we can scale it at a reasonable cost and wait for AWS Keyspaces compatibility.
Implementation
We have a simple Java DSL Interpreter Workflow implementation that at this point in time it parses a list of activities and calls them by name:
…
@Override
public void run(String dslInput) {
Dsl dsl = mapper.readValue(dslInput, Dsl.class);
…
for (DslActivity dslActivity : dsl.getActivities()) {
Workflow.getLogger(this.getClass().getSimpleName())
.info(
"Calling activity {} from workflow {} {}",
dslActivity.getName(),
Workflow.getInfo().getWorkflowId(),
Workflow.getInfo().getRunId()
);
activity.execute(dslActivity.getName(), String.class, dslActivity.getInput());
}
…
}
The activities don’t do anything, just log their execution. We realise that this is a bit of a stretch that will put more load on Temporal. We could put some delays within the activity execution to simulate some business logic execution, but we are not doing that at this point.
Configuration
We are running Temporal with numHistoryShards: 512
For the workers we doubled the parameters in the default configuration for the factory and 4 times for the worker:
- name: TEMPORAL_FACTORY_WORKFLOW_CACHE_SIZE
value: "1200"
- name: TEMPORAL_FACTORY_MAX_WORKFLOW_THREAD_COUNT
value: "1200"
- name: TEMPORAL_FACTORY_WORKFLOW_HOST_LOCAL_POLL_THREAD_COUNT
value: "10"
- name: TEMPORAL_WORKER_ACTIVITY_POLL_THREAD_COUNT
value: "20"
- name: TEMPORAL_WORKER_WORKFLOW_THREAD_COUNT
value: "8"
- name: TEMPORAL_WORKER_MAX_CONCURRENT_ACTIVITY_EXECUTION_SIZE
value: "800"
- name: TEMPORAL_WORKER_MAX_CONCURRENT_LOCAL_ACTIVITY_EXECUTION_SIZE
value: "800"
- name: TEMPORAL_WORKER_MAX_CONCURRENT_WORKFLOW_TASK_EXECUTION_SIZE
value: "800"
Results
We are running the experiment as following:
- Initiate 2000 workflow executions without any worker alive
- Start the Workflow worker and wait until all workflows are processes
We are looking for the following metrics to evaluate the results
- WPS: average workflow terminations / second
- APS: average activity completions / second
Additionally, we also look at the standard deviation for the above metrics, the delay in activity and workflows execution, but for the sake of simplicity, we are omitting them here.
On the resources utilisation side, we are monitoring the DB, Temporal and Workers CPU and Memory utilisation.
DB - 2vCores, 8Gb Memory
In this case, we are basically reaching 100% CPU utilisation in any configuration: single worker, separate workflow/activity workers or even multiple workflow workers. We would have expected that using multiple workers would allow for a better cache utilisation, and therefore, most of the workflow objects would sit in cache, but instead we did not see any performance increase (maybe even a slight decrease).
We have measures WPS in the interval [8, 10] and APS in the interval [40-50] during multiple runs.
DB - 4vCores, 16Gb Memory
We doubled the size of our database, case in which we have tried the following scenarios:
-
1x workflow, 1x activity workers:
WPS - 10.9
DB CPU utilisation - 40% -
2x workflow, 2x activity workers:
WPS - 16
DB CPU utilisation - 80% -
2x workflow, 4x activity workers:
WPS - 14.5
DB CPU utilisation - 80%
The results lead us to believe that the DB can handle more concurrent workflow executions, therefore, running more workflow workers increases the throughput. As expected, more activity workers do not improve the results, as they don’t do any complex work.
Discussion
We would have expected a better throughput for this database size both in terms of WPS and APS. Our investigation into the database performance has highlighted a single query makes most of the load onto the DB:
INSERT INTO history_node (shard_id, tree_id, branch_id, node_id, txn_id, data, data_encoding) VALUES ($1, $2, $3, $4, $5, $6, $7)
In order to mitigate its impact and to reduce its time to wait for locks, we have tried partitioning the history_node
table by shard_id (creating 512 partitions). Unfortunately, instead of seeing a performance increase, we have noticed a 20-30% decrease in WPS. This is very curious and we did not yet understand why this happens.