Hello team,
We run Temporal (Server version 1.18.5, UI version 2.10.3) as a self-hosted cluster in production.
Our persistence layer is split as follows:
- Executions DB – MongoDB 7 (stays online)
- Visibility DB – separate AWS aurora PostgreSQL 12 cluster with 1 writer and 1 reader. We do not use advanced visibility (available with v1.20+) and since server version is 1.18.5, we can’t (and do not prefer to) use dual visibility (available with v1.21+).
Because AWS Postgres 12 left standard support early 2025 we’re planning to upgrade the visibility db cluster to PostgreSQL 16 in blue/green deployment fashion.
We expect ~5 minutes of downtime during primary instance switchover but want to be ready for an extended period (hour or so long) if something goes wrong.
My understanding is:
- Workflow and Activity execution is unaffected. All execution history still goes to the executions DB, so workers keep polling and progressing. These are highly critical workflows and activities and we can’t afford to lose any data here.
- While the visibility DB is unavailable:
- Temporal UI may return errors or empty results. We do not use
tctl
or query visibility api directly from other sources so not concerned about that. - VisibilityManager logs lots of failed to persist visibility updates errors; the tasks queue up in executions DB and are replayed after the visibility DB is back.
- No user-data loss should occur; once visibility is back, the backlog drains and UI/queries work again.
- Temporal UI may return errors or empty results. We do not use
Questions
-
Is the above understanding correct?
-
If the outage stretches toward the hour-mark, the
visibility_tasks
table (and the underlying queue shards) could balloon. How large can that backlog get before it affects History or overall cluster health? Is it safe to truncate or delete old rows, accepting that some executions would forever be absent from search/UI if disk usage becomes critical? For context: ~10-20M workflow starts/day, retention = 30 days for most namespaces. Archival and ElasticSearch are disabled. -
Any hidden pitfalls (e.g., namespace-scanner, DescribeWorkflowExecution fallback) if the visibility outage stretches to more than 60 minutes?
Thanks in advance for confirming (or correcting!) my assumptions.