Slow workflow completion rates in a Temporal cluster running on RDS

Himanshu_Ladia · May 10, 2025, 2:06am

Issue

After stopping our Temporal cluster to allow an RDS autovacuum to complete, we’re now experiencing severe performance degradation. The cluster is overwhelmed with history_node DELETE operations, preventing normal workflow execution.

Sequence of Events

RDS autovacuum triggered, throttling workflow executions
Stopped the Temporal cluster to allow RDS to complete autovacuum
Restarted Temporal cluster after ~4-6 hours
Now seeing massive history_node DELETE operations consuming all RDS resources
Workflow execution throughput has dropped dramatically

Current State

Database metrics show high Average Active Sessions (40-50 range)
Predominantly DELETE operations from history_node table
Normal workflow scheduling and execution appears blocked by cleanup operations
We’re deleting existing non critical workflows to minimise load

Critical Questions

Is there a way to temporarily disable history cleanup until peak load subsides?
What emergency measures can we take to prioritize workflow execution over history cleanup?
Are there specific config parameters we should modify to recover from this state?

Environment

Database: Amazon RDS (PostgreSQL 13)
Temporal server version: 1.18

Topic		Replies	Views
History_node keeps growing Community Support postgresql	12	2012	January 16, 2023
[Urgent] Upgraded to 1.11.2 and making 11k calls/sec to Postgres Community Support typescript-sdk	19	177	October 23, 2024
PostgreSQL (AWS RDS) - Scale up Community Support scaling , postgresql	6	2443	April 9, 2024
Database spike because of retention deletion process Community Support general-impl , database , visibility	3	21	February 24, 2025
Domain history cleanup Community Support cassandra	3	2208	July 22, 2020