Impact of temporarily taking the visibility Postgres DB offline (12 → 16 upgrade)

rsparihar · June 4, 2025, 11:43pm

Hello team,

We run Temporal (Server version 1.18.5, UI version 2.10.3) as a self-hosted cluster in production.

Our persistence layer is split as follows:

Executions DB – MongoDB 7 (stays online)
Visibility DB – separate AWS aurora PostgreSQL 12 cluster with 1 writer and 1 reader. We do not use advanced visibility (available with v1.20+) and since server version is 1.18.5, we can’t (and do not prefer to) use dual visibility (available with v1.21+).

Because AWS Postgres 12 left standard support early 2025 we’re planning to upgrade the visibility db cluster to PostgreSQL 16 in blue/green deployment fashion.

We expect ~5 minutes of downtime during primary instance switchover but want to be ready for an extended period (hour or so long) if something goes wrong.

My understanding is:

Workflow and Activity execution is unaffected. All execution history still goes to the executions DB, so workers keep polling and progressing. These are highly critical workflows and activities and we can’t afford to lose any data here.
While the visibility DB is unavailable:
- Temporal UI may return errors or empty results. We do not use tctl or query visibility api directly from other sources so not concerned about that.
- VisibilityManager logs lots of failed to persist visibility updates errors; the tasks queue up in executions DB and are replayed after the visibility DB is back.
- No user-data loss should occur; once visibility is back, the backlog drains and UI/queries work again.

Questions

Is the above understanding correct?
If the outage stretches toward the hour-mark, the visibility_tasks table (and the underlying queue shards) could balloon. How large can that backlog get before it affects History or overall cluster health? Is it safe to truncate or delete old rows, accepting that some executions would forever be absent from search/UI if disk usage becomes critical? For context: ~10-20M workflow starts/day, retention = 30 days for most namespaces. Archival and ElasticSearch are disabled.
Any hidden pitfalls (e.g., namespace-scanner, DescribeWorkflowExecution fallback) if the visibility outage stretches to more than 60 minutes?

Thanks in advance for confirming (or correcting!) my assumptions.

tihomir · June 13, 2025, 4:27pm

All execution history still goes to the executions DB, so workers keep polling and progressing. These are highly critical workflows and activities and we can’t afford to lose any data here.

I think in general you are correct here. As you mentioned however visibility transfer queue are on shards too and without knowing your workload size and rate of updates that would happen in the down-time, hard to tell how that will affect memory use on shard level (including retries given duration of the outage)

tihomir · June 13, 2025, 4:28pm

Another thing I think to look at is worker service and your scavenger workflows which rely on visibility to get list of executions to remove on namespace retention for example. Depending on how many executions are complete and hit ns retention, you could be running into possible backlogs and retries (context timeouts) on that side as well

tihomir · June 13, 2025, 4:29pm

Also maybe not strictly relatable to your case as you seem to be using older server version, but if your workers use advanced task queue describe to get task backlog for scaling, note its a visibility call as well.

tihomir · June 13, 2025, 4:31pm

Again, in general I think yours assumptions are on the right track, but pls understand that this is something you should first imho test on a test cluster and make sure you have server metrics in place to understand possible issues that come up and then we can try to help resolve (on test cluster), before you do this in your mission critical prod env. Hope this helps

tihomir · June 13, 2025, 4:35pm

No user-data loss should occur; once visibility is back, the backlog drains and UI/queries work again.

this is something id test out too as data for completed executions would not be on your new vis store. also workflow executions that time out during down time might not reflect that in visibility on your new store when its up as those updates might time out before your new vis store is up.
I would honestly expect some discrepancies here since you are not using dual visibility.

rsparihar · June 13, 2025, 11:33pm

thanks, @tihomir. This is helpful.

Topic		Replies	Views
DB migration support for Temporal deployment Server Deployment elasticsearch , cassandra , postgresql	4	1192	April 1, 2023
Can i disable "temporal_visibility" db? Server Deployment	3	792	April 2, 2024
Deployment of Temporal Community Support helm , database	5	3245	April 18, 2022
Architectural understanding with data persistence for visibility with Postgres and ES Developer Corner general-impl	2	312	April 15, 2024
Errors when deleting executions visibility Server Deployment	0	290	December 13, 2023

Impact of temporarily taking the visibility Postgres DB offline (12 → 16 upgrade)

Related topics