How big a DB is "expected" to be needed

Randell · October 4, 2024, 11:39pm

We are using rds postgres for our db in our OSS temporal cluster. And we saw that that DB was the bottleneck. So we upped it twice and now it is better. But we aren’t really doing that much in my opinion. We are now using a db.r6g.4xlarge instance, 512 shards. We are giving it 50 workflows per second. And the workflow has 11 activities. 47 state transitions. About 30 second runtime. DB CPU hit about 60%. When we were using db.r6g.2xlarge we were pegging the CPU and getting lots of problems.

Are there any rules of thumb on how much you can run per db size? And what are the major contributors to DB cpu in general. The connections look pretty stable, so I don’t think it is making lots of connections to eat up the CPU. The insights say commit is the big deal. But not a ton of detail there. The next item on the list is insert into cluster_membership, then insert into history_node.

My guess here is that something about how we developed our workflows is causing undue load on the db. But I just don’t understand what all is happening under the hood to know what that might be… So I wanted to get a general idea of what size db we should expect to need, and what types of things add cpu load to the db.

tihomir · October 11, 2024, 1:48pm

One thing to look into is that server does background jobs per shard, even when server is idle that can result in db read at some frequencies. Have seen some users increase the default frequencies in cases where they have added unwanted loads (such as bump db cpu use) at times. Useful dynamic configs that you can look into increasing for this are:

history.transferProcessorMaxPollInterval (default 1m)
history.timerProcessorMaxPollInterval (default 5m)
history.visibilityProcessorMaxPollInterval (default 1m)
history.shardUpdateMinInterval (default 5m)

tihomir · October 11, 2024, 1:50pm

Another thing you should look into is how much extra pressure are possibly unprovisioned sdk workers you might have adding on the db.
For this you can look at your service metric:

sum(rate(persistence_requests{operation="CreateTask"}[1m]))

When a workflow/activity task is created and moved to matching service task queue partition, if there is no available sdk worker poller to pick up this task, matching has to persist it to db, then read it back out when a poller is available. So provisioning your sdk workers to minimize need of service to write/read task to/from db can help lessen the db load.

tihomir · October 11, 2024, 2:00pm

You can also look at your workflow execution event histories, as in how large they are, can use visibility search attributes HistorySizeBytes and HistoryLength.
If you have very large event histories in terms of size, this can result on larger read/writes and if you have large event histories in terms of event count it could result in your workers having to call GetWorkflowExecutionHistory api more since this api is paginated.
Tuning your sdk worker cache size given the memory (heap size) you give your worker pods can help with lowering need to fetch event histories that could help on db end.

tihomir · October 11, 2024, 2:03pm

Protect your db, meaning regardless of db size you have configuration settings in Temporal that can help you not overload it which can cause bigger issues.
Useful dynamic configs for this:

frontend.persistenceMaxQPS for frontend persistence max qps
history.persistenceMaxQPSfor history persistence max qps
matching.persistenceMaxQPSfor matching persistence max qps

watch your resource exhausted server metrics:

sum(rate(service_errors_resource_exhausted{}[1m])) by (operation, resource_exhausted_cause)

for resource exhausted cause SystemOverloaded which is typically related to db overload

tihomir · October 11, 2024, 2:07pm

Would also look at your persistence latencies:

histogram_quantile(0.95, sum(rate(persistence_latency_bucket{}[1m])) by (operation, le))

and if you can share the graph so can take a look at possible culprits that can maybe help give more recommendations.

overall your workload size will heavily drive the db size you need now and can predict you will need in the near/far future as your workloads are growing.
note that temporal service replication feature would allow you to set up a bigger cluster with bigger db and the “migrate” existing workloads to this bigger cluster that can have shard number that is a multiple of the currently running one you have, so upgrading to bigger env is possible

Topic		Replies	Views
PostgreSQL (AWS RDS) - Scale up Community Support scaling , postgresql	6	2486	April 9, 2024
Postgresql Specification for temporal Server Deployment postgresql	3	1548	January 11, 2023
DB size in Temporal Community Support	4	725	February 23, 2023
Postgres db instance sizing recommendations Community Support database	1	3002	December 9, 2021
High DB CPU utilization with sub optimal number of history shards? Community Support aurora	0	263	October 16, 2023

How big a DB is "expected" to be needed

Related topics