Database CPU spike when multiple workflows are triggered simultaneously

We’re using AWS Aurora PostgreSQL (db.r6g.2xlarge) as the underlying database for our temporal-server. When around 6,000 workflows were triggered within 4 minutes, DB CPU utilization spiked. We’re seeking best practices to prevent such spikes.

Can we configure temporal-service to throttle requests (e.g., X requests/second)? If so, how?

Additionally, are there other recommended approaches (apart from upgrading the DB instance) to handle this more efficiently?

Would check how much extra pressure on db is created by unprovisioned sdk workers:

sum(rate(persistence_requests{operation="CreateTask"}[1m]))

You can also look at adjusting your qps limit to “protect” your db, dynamic configs info:

  • Per host dynamic configs:
    • frontend.persistenceMaxQPS - default 2400
    • matching.persistenceMaxQPS - default 2400
    • history.persistenceMaxQPS - 1default 6000
  • Per service type dynamic configs:
    • history.persistenceGlobalMaxQPS - default 36000
  • Per shard dynamic configs:
    • history.persistencePerShardNamespaceMaxQPS - default 500