Issue
After stopping our Temporal cluster to allow an RDS autovacuum to complete, we’re now experiencing severe performance degradation. The cluster is overwhelmed with history_node DELETE operations, preventing normal workflow execution.
Sequence of Events
- RDS autovacuum triggered, throttling workflow executions
- Stopped the Temporal cluster to allow RDS to complete autovacuum
- Restarted Temporal cluster after ~4-6 hours
- Now seeing massive history_node DELETE operations consuming all RDS resources
- Workflow execution throughput has dropped dramatically
Current State
- Database metrics show high Average Active Sessions (40-50 range)
- Predominantly DELETE operations from history_node table
- Normal workflow scheduling and execution appears blocked by cleanup operations
- We’re deleting existing non critical workflows to minimise load
Critical Questions
- Is there a way to temporarily disable history cleanup until peak load subsides?
- What emergency measures can we take to prioritize workflow execution over history cleanup?
- Are there specific config parameters we should modify to recover from this state?
Environment
- Database: Amazon RDS (PostgreSQL 13)
- Temporal server version: 1.18