Slow workflow completion rates in a Temporal cluster running on RDS

Issue

After stopping our Temporal cluster to allow an RDS autovacuum to complete, we’re now experiencing severe performance degradation. The cluster is overwhelmed with history_node DELETE operations, preventing normal workflow execution.

Sequence of Events

  1. RDS autovacuum triggered, throttling workflow executions
  2. Stopped the Temporal cluster to allow RDS to complete autovacuum
  3. Restarted Temporal cluster after ~4-6 hours
  4. Now seeing massive history_node DELETE operations consuming all RDS resources
  5. Workflow execution throughput has dropped dramatically

Current State

  • Database metrics show high Average Active Sessions (40-50 range)
  • Predominantly DELETE operations from history_node table
  • Normal workflow scheduling and execution appears blocked by cleanup operations
  • We’re deleting existing non critical workflows to minimise load

Critical Questions

  1. Is there a way to temporarily disable history cleanup until peak load subsides?
  2. What emergency measures can we take to prioritize workflow execution over history cleanup?
  3. Are there specific config parameters we should modify to recover from this state?

Environment

  • Database: Amazon RDS (PostgreSQL 13)
  • Temporal server version: 1.18