"History scavenger does nothing on XDC standby cluster — how is secondary DB size supposed to stay bounded?"

jmo · May 22, 2026, 7:56pm

We’re running Temporal 1.31 with XDC replication across two Aurora PostgreSQL clusters. Our secondary is 3.6TB vs our primary at 419GB after months of healthy replication.

We can see temporal-sys-history-scanner-workflow starting on the secondary, but the task queue scavenger immediately cleans up temporal-sys-history-scanner-taskqueue because no workers are polling it — our application workers only connect to the primary. The scanner starts, schedules activities, they never get picked up.

Is the history scavenger expected to be a no-op on standby clusters? If so, what’s the intended mechanism for keeping secondary DB size bounded over time?

Yimin_Chen · May 27, 2026, 4:17pm

Could you help list the largest tables from active cluster and compare them with standby cluster?

jmo · May 27, 2026, 6:39pm

table	primary	secondary
`history_node`	344 GB	3,474 GB
`history_tree`	12 GB	82 GB
`executions`	38 GB	48 GB
`current_executions`	12 GB	16 GB
everything else	~30 GB	~30 GB

jiechenz · May 27, 2026, 11:45pm

Hi, thanks for sharing the table sizes.

Answers to your questions:

The history scavenger is NOT expected to be a no-op on standby clusters. It runs on every cluster independently.
The scavenger does not use your application workers, it is executed by temporal’s system worker.

The cleanup mechanism on each cluster should work as:

Retention triggered deletion will delete the mutable states.
Scavenger clean up orphans left behind in history_tree.

From the table sizes, it’s likely that orphan branches accumulating and scavenger isn’t reclaiming them.

Could you help to verify on following things so we can confirm before changing anything in prod:

 SELECT
      COUNT(*)                                                   AS total_branches,
      COUNT(*) FILTER (WHERE e.run_id IS NULL)                   AS orphan_branches,
      ROUND(100.0 * COUNT(*) FILTER (WHERE e.run_id IS NULL)
                    / NULLIF(COUNT(*), 0), 2)                    AS pct_orphan
  FROM history_tree ht
  LEFT JOIN executions e ON e.run_id = ht.tree_id;

Can you run the SQL on both active and passive side, we should expect a high percentage of orphan branches on passive.
Check the scavenger task queue and workflow on secondary.

temporal --address <passive-cluster> task-queue describe \
    --namespace temporal-system \
    --task-queue temporal-sys-history-scanner-taskqueue-0

temporal --address <passive-cluster> workflow describe \
    --namespace temporal-system \
    --workflow-id temporal-sys-history-scanner

Note the -0 suffix in the task queue name.

Check scavenger metrics on the secondary. Then prometheus queries you could use:

sum by (cluster) (rate(scavenger_success_total{operation="HistoryScavenger"}[30m]))

sum by (cluster) (rate(scavenger_errors_total{operation="HistoryScavenger"}[30m]))

Thank you!

jmo · June 1, 2026, 11:06pm

Thanks — the scavenger is running fine on the secondary. Here’s what we found (current as of today):

	total_branches	orphan_branches	pct_orphan
primary	7,425,622	165,385	2.23%
secondary	170,095,104	214,827	0.13%

The secondary has a lower orphan rate than primary — the scavenger is working. The size difference is not orphans.

The ~170M non-orphan branches on secondary all have matching rows in executions on secondary, so the scavenger correctly leaves them alone. But primary only has 7.4M total branches — meaning primary has already scavenged history for ~162M workflows that secondary still holds in full.

So the question becomes: why doesn’t secondary’s retention timer clean up those ~162M non-orphan branches after the retention period expires? Our understanding is that the standby timer queue executor discards retention timer tasks rather than executing them (timer_queue_standby_task_executor.go). Is that correct, and if so, is Archival the intended solution for bounding secondary DB size?

jiechenz · June 2, 2026, 6:32am

Thanks for running the queries. The data is actually really informative, and it confirms the scavenger is running fine on secondary.
Comparing with your previous size table, it’s likely that you have roughly the same workflow count on both clusters, but many more branches per workflow on the secondary. To confirm, could you help to run this on clusters:

  SELECT
    (SELECT COUNT(*) FROM executions)                  AS workflow_rows,
    (SELECT COUNT(DISTINCT tree_id) FROM history_tree) AS unique_trees,
    (SELECT COUNT(*) FROM history_tree)                AS branch_rows,
    ROUND(1.0 * (SELECT COUNT(*) FROM history_tree)
              / NULLIF((SELECT COUNT(DISTINCT tree_id) FROM history_tree), 0), 2)
      AS avg_branches_per_workflow;

On the other two points you have:

I think retention is running on secondary. If it weren’t, executions would be far larger than 48 GB.
Archival isn’t the size-bounding mechanism here. It offloads closed-workflow history to long-term storage; it doesn’t touch in-flight branches, where the branches can link to an execution

If the SQL confirms the per-workflow branch skew, I’d want to narrow down what’s driving it. Some quick questions on your operational profile over the last few months:

Any temporal workflow reset calls (in tooling or ad-hoc) — roughly what frequency?
Any namespace failovers or active-cluster changes — frequency, and were workflows progressing during the cutover?
Typical workflow lifetime: short (minutes/hours) or long-lived (days/weeks)?

Thanks!

jmo · June 2, 2026, 11:53pm

Results:

	workflow_rows	unique_trees	branch_rows	avg_branches/workflow
primary	7,315,379	7,486,761	7,487,013	1.00
secondary	22,028,854	22,031,597	169,843,823	7.71

Primary has 1.00 branches per workflow — secondary has 7.71. Confirmed skew.

On your three questions:

Resets: two mechanisms — manual resets roughly once/week, and an automated detector that resets scheduler workflows when they get genuinely wedged (pending task stuck in SCHEDULED state for 5+ minutes, with a 1-hour cooldown per workflow). The auto-reset fires on a rare failure condition, not routinely. Not sure ever fires enough to explain the 7.71x branch accumulation?
Failovers: none — active cluster has been constant on primary the entire time.
Workflow lifetime: minutes to hours, short-lived.

Given resets are rare, we’re not sure what’s driving the branch accumulation on secondary. Happy to run any additional queries to help narrow it down.

jiechenz · June 3, 2026, 6:04pm

Thanks, it confirms that there are two things we need to chase down:

Secondary has ~7× more branches per workflow.
Secondary also has ~3× more workflow rows (22 M vs 7.3 M).

Could you help to check on following things:

What is the retention time on both clusters? Could you run namespace describe on your top namespaces. *Please hide any PII as needed.

 temporal --address <primary>  operator namespace describe <top-ns> --output json
  temporal --address <secondary> operator namespace describe <top-ns> --output json

Can you check the counts of workflow past retention.

  -- run on BOTH clusters
  SELECT
    state,
    COUNT(*) AS rows,
    COUNT(*) FILTER (WHERE close_time < NOW() - INTERVAL '30 days') AS past_retention
  FROM executions
  GROUP BY state;

Q: Did you run any manual workflow deletions?

jmo · June 4, 2026, 5:34pm

Retention is 1080h (45 days) on both clusters, configured identically. One thing we noticed is that secondary shows ReplicationConfig.State: Unspecified while primary shows Normal. Not sure if that’s significant.

Unfortunately we can’t run the past-retention query as written against our schema:

close_time doesn’t exist as a column in executions
state is a protobuf blob, so GROUP BY state returns millions of unique rows rather than a useful breakdown by status

Is there an alternative query we can run to get the same information?

On manual deletions: we have an automated janitor that runs batch deletes via the Temporal API every 15 minutes on completed workflows. No ad-hoc manual deletions beyond that.

jiechenz · June 4, 2026, 8:59pm

Ah — I think we’ve found the major driver of the 3× workflow-row gap on secondary.

DeleteWorkflowExecution API calls aren’t replicated to passive clusters by default. We only recently addressed this in 1.31 (Replicate workflow deleteion by jiechenz · Pull Request #9717 · temporalio/temporal · GitHub), but it’s still behind a feature flag — history.enableDeleteWorkflowExecutionReplication, which defaults to false. So
your janitor’s been only cleaning up workflows on primary only, while secondary holds onto the 45-day retention timer fires.

If you’d like to verify quickly: I think you may grab any workflow ID from your janitor’s audit log that should have been deleted recently, and check whether it still exists on the passive cluster:

tdbg --address workflow describe –workflow-id --run-id or a direct query on passive DB.

If it’s still there on passive but gone from primary → confirmed.

Suggested next steps:

Enable the flag on your clusters: set the feature flag to true

history.enableDeleteWorkflowExecutionReplication: true

That’ll make future janitor runs propagate to passive correctly, stopping the gap from growing any further.

Clean up the historical backlog on passive.
I’m guessing you don’t want to wait 45 days for it to drain naturally. The cleanest option is a one-shot script that mirrors your janitor’s delete pattern against secondary’s admin endpoint.

One important note: make sure your script targets secondary directly and bypasses DCRedirection, otherwise the calls will get forwarded back to primary and won’t do anything useful on secondary. The header to disable redirection is:
–grpc-meta xdc-redirection=false
Once the above is done, let’s revisit DB sizes and tackle the 7× branches-per-workflow issue separately.
On the namespace describe output: could you double-check that active_cluster_name and the clusters list match across both sides? If they’re out of sync, that’d point to something like an active-active situation, which would be a real concern. If those two match, the ReplicationState: UNSPECIFIED flag is most likely benign (probably a metadata-format artifact) and we can come back to it later.

Let us know how the clean up goes and happy to help you with any questions!

jmo · June 4, 2026, 9:51pm

Verified. Here’s what we found:

1. Janitor gap confirmed

We sampled 1,000 executions rows from secondary shard 5 and cross-checked against primary’s persistence layer. 327/1,000 (32.7%) exist on secondary but are absent from primary - the janitor deleted them from primary, secondary still holds them.

2. Namespace config matches across both sides

ActiveClusterName: hub-alpha        (both)
Clusters: [hub-alpha, hub-beta]     (both)
ReplicationConfig.State: Normal     (primary)
ReplicationConfig.State: Unspecified (secondary)

Not active-active. The Unspecified state on secondary is the only difference — flagging it in case it’s relevant.

Next steps from our side:

We’ve already enabled history.enableDeleteWorkflowExecutionReplication: true on dev primary. We’ll roll it to staging and production once we confirm it’s working.

For the historical backlog, you mentioned a one-shot cleanup script targeting secondary directly with --grpc-meta xdc-redirection=false. Could you share more detail on what that looks like? Specifically: does it use tdbg or the standard temporal CLI, and is there a way to scope it to workflows older than a given date to avoid touching anything active?

jiechenz · June 5, 2026, 12:40am

Thanks for checking!

To clean up on passive side, two options I can think of:

Create a shell script that run temporal workflow list (with xdc-redirection=false) to get the list of workflows and runID. You may run a DB query on passive to get the info as well. Then use tdbg to delete the workflow on passive only.
Please see following examples:

# list workflows from passive  
temporal --address <passive-endpoint> workflow list \
    --namespace <your-namespace> \
    --query 'ExecutionStatus != "Running" AND CloseTime < "2026-06-05T00:00:00Z"' \
    --grpc-meta xdc-redirection=false \
    --output json \
  | jq -r '.[] | "\(.execution.workflowId)\t\(.execution.runId)"' 

# delete on passive
./tdbg --address <passive-endpoint> --namespace <your-namespace>  workflow delete  --workflow-id <wf-id>  --run-id <run-id>

Note: validate on a small slice first, then ramp up gradually while watching DB load on passive.

Create a cleanup workflow. Given the size of backlog, I think it’s a more production-grade approach. Please see a following sketch as example.

CleanupWorkflow

func noRedirect(ctx context.Context) context.Context {
      return metadata.AppendToOutgoingContext(ctx, "xdc-redirection", "false")
  }

  type WorkflowKey struct{ Namespace, WorkflowID, RunID string }

  type Activities struct {
      PassiveWF    workflowservice.WorkflowServiceClient // passive frontend to list workflows
      PassiveAdmin adminservice.AdminServiceClient       // passive admin to delete workflows
  }

  func (a *Activities) ListPassive(ctx context.Context, in ListInput) (*ListOutput, error) {
      resp, err := a.PassiveWF.ListWorkflowExecutions(noRedirect(ctx),
          &workflowservice.ListWorkflowExecutionsRequest{
              Namespace:     in.Namespace,
              Query:         in.Query, // "ExecutionStatus!='Running' AND CloseTime<'2026-05-01T00:00:00Z'"
              PageSize:      int32(in.PageSize),
              NextPageToken: in.Cursor,
          })
      if err != nil {
          return nil, err
      }
      out := &ListOutput{NextCursor: resp.NextPageToken}
      for _, e := range resp.Executions {
          out.Items = append(out.Items, WorkflowKey{
              Namespace: in.Namespace, WorkflowID: e.Execution.WorkflowId, RunID: e.Execution.RunId,
          })
      }
      return out, nil
  }

  func (a *Activities) DeletePassive(ctx context.Context, key WorkflowKey) error {
      _, err := a.PassiveAdmin.DeleteWorkflowExecution(ctx,
          &adminservice.DeleteWorkflowExecutionRequest{
              Namespace: key.Namespace,
              Execution: &commonpb.WorkflowExecution{WorkflowId: key.WorkflowID, RunId: key.RunID},
          })
      var notFound *serviceerror.NotFound
      if err == nil || errors.As(err, &notFound) {
          return nil
      }
      return err
  }

And a cleanup workflow to run the activities.

jmo · June 6, 2026, 12:06am

Thanks for the tips, made some real progress! Where we’re at:

Root cause confirmed (Issue 1 — janitor gap)

We confirmed history.enableDeleteWorkflowExecutionReplication: true is working correctly. We sampled a workflow deleted by the janitor after the flag was enabled and verified it is absent from both primary and secondary persistence. The 32.7% gap we measured earlier was entirely pre-flag historical backlog, not new accumulation.

Cleanup tool

We initially tried two shell-based approaches that both failed:

tdbg execution delete — only deletes mutable state and history, not visibility. We confirmed this empirically: after 871K deletions the visibility count barely moved, and workflows kept reappearing in list queries causing the script to loop past 150% of the expected count.
temporal workflow delete (CLI/frontend API) — the frontend checks cluster ownership before accepting the delete. On a passive standby it returns “workflow execution not found” and rejects the call even with xdc-redirection=false.

We ended up writing a Go tool using adminservice.DeleteWorkflowExecution with xdc-redirection=false as you suggested in post #12. We verified deletions propagate correctly to all three stores: secondary executions, secondary executions_visibility, and primary is untouched.

We’re currently running the tool across all ~50 namespaces on staging hub-beta (5 in parallel,
~550k workflows each, ~1hr per namespace). 11 namespaces complete so far, 0 errors.

Issue 2 — branch accumulation

We don’t have new findings on the 7.71x branches/workflow discrepancy yet. Still awaiting your guidance — is there a known mechanism by which hub-beta would retain non-orphaned history branches that hub-alpha’s scavenger has already cleaned?

jiechenz · June 9, 2026, 1:33am

Nice, good to hear!

Cleanup tool

tdbg workflow delete does delete visibility via the admin client (service/history/api/forcedeleteworkflowexecution/api.go:118-121). What you ran into is likely a known race with async visibility-task processing. Anyway it should be used only for small batches.
The workflow delete/describe shouldn’t have the active-cluster gate. Could you check which version of temporal CLI you are using? I suspect some older versions of CLI don’t don’t propagate --grpc-meta properly.

Backlog Cleanup

Once you’ve finished draining the stale workflows on passive, pls allow the history scavenger some time to do at least one full pass to clean up the branches. After that, could you rerun our queries to exam the table sizes? If the skew persists after the drain and scavenger pass, we’ll dig deeper into NDC fork production rate. Thanks!

jmo · June 9, 2026, 10:45pm

Update: Scavenger is running — but finding almost no orphans

We finished draining all pre-flag stale workflows from stg-hubb on June 6 (~28M deletions across 51 namespaces using adminservice.DeleteWorkflowExecution). The history scavenger started a new pass on June 6 and is actively running:

Progress: Shard 82/512 (16%), ~3 days elapsed
Pages scanned: 254,460
SkipCount: 25,439,242 (branches kept as non-orphans)
SuccessCount: 6,696 (orphans deleted)
Orphan rate: 0.026%
Est. completion: ~15 more days at current pace

The branches-per-workflow query now shows 15.40x on secondary (up from 7.71x before cleanup) — the ratio worsened because we deleted the executions rows on stg-hubb but the history_tree rows remain, so unique_trees dropped while branch_rows stayed at 156M.

The puzzling part: after deleting 28M executions rows from stg-hubb secondary, we expected the corresponding history_tree rows to appear as orphans to the scavenger. But SkipCount is 25M with only 6,696 deletions — it’s classifying almost everything as non-orphan.

Is the orphan check joining against the local executions table only, or does it factor in replication state or primary cluster data? We want to make sure the scavenger will actually reclaim the 156M branch rows once it completes its pass, rather than skipping them all.

jiechenz · June 10, 2026, 10:02pm

Thanks for the update!
It’s my bad, the scavenger has an age filter that has a default minAge of 60 days.

github.com/temporalio/temporal

service/worker/scanner/history/scavenger.go

5b49acfaf


      
          
          	if !s.isInTest {
          		activity.RecordHeartbeat(ctx, s.hbd)
          	}
          }
          
          func (s *Scavenger) filterTask(
          	branch persistence.HistoryBranchDetail,
          ) *taskDetail {
          
          	if time.Now().UTC().Add(-s.historyDataMinAge()).Before(timestamp.TimeValue(branch.ForkTime)) {
          		metrics.HistoryScavengerSkipCount.With(s.metricsHandler).Record(1)
          
          		s.Lock()
          		defer s.Unlock()
          		s.hbd.SkipCount++
          		return nil
          	}
          
          	namespaceID, workflowID, runID, err := persistence.SplitHistoryGarbageCleanupInfo(branch.Info)
          	if err != nil {

That will explain why it has skipped most of the records

worker.historyScannerDataMinAge:
   - value: "1h"

Could you try update the config to a shorter period of time and rerun the scavenger?
Thanks!

jiechenz · June 15, 2026, 4:01pm

Hi @jmo , just checking in on this. Were you able to try the suggested solution and did it help resolve the issue?
I am happy to help further if it’s still ongoing.

jmo · June 15, 2026, 5:22pm

Yes we made quite a bit of progress! I wanted to let the scavenger run over the weekend to get something closer to real numbers before posting.

We applied worker.historyScannerDataMinAge: 1h to all three clusters and the difference was immediate. Before the change, the scavenger was scanning 25M branches and deleting only ~6,700 (0.026% orphan rate). After the change, it’s deleting orphans rapidy.

Current state on cluster2 (most data, best signal):

Scavenger at shard 141/512 (28%), 32.8M orphans deleted
avg_branches_per_workflow: 15.40x → 13.41x and dropping
cluster1: 24.9M deletions across shard 38/512

One follow-up question: is there a reason the default minAge is 60 days? With a default 45-day retention window, the scavenger won’t clean orphans created within the retention period. They only become eligible 60 days after fork, by which point retention already fired weeks ago. Is this much shorter default generally safe to use for prod?

We’ll post final numbers once the scavenger completes its full pass. Ultimately is our victory condition that avg_branches_per_workflow settles at around 1.0?

jiechenz · June 15, 2026, 10:34pm

Glad the change took effect. Two answers:

On the 60-day default: that value is from an older setup that hasn’t been revisited in a while. For your case, you can use a much shorter period — 1h is fine for the drain, and a value <= 1-week is reasonable as a comfortable ops margin.
On the end goal: with resets rare and no failovers, secondary should settle close to primary. The scavenger only reclaims branches whose workflow is deleted; non-current NDC
fork branches attached to live workflows aren’t touchable until the workflow closes. That tail of in-flight branches keeps secondary slightly above primary even in healthy steady state.

That said, getting to that range might take multiple scavenger cycles.

Looking forward to the final numbers — thanks!

jiechenz · June 24, 2026, 8:49pm

Hi @jmo , checking in again. How does the number look like after the fully clean up? Thanks

Topic		Replies	Views
Whats optimal settings for quick xdc replication? what metrics can one track? Community Support xdc	12	1265	July 8, 2024
Should the secondary XDC cluster be far larger than the primary? Community Support xdc , replication	1	64	July 21, 2026
Domain history cleanup Community Support cassandra	3	2377	July 22, 2020
XDC Replication in Practice Community Support java-sdk , xdc	26	3804	September 18, 2023
Errors in frontend and hisotry Community Support history , worker , grpc	26	2514	April 9, 2022

"History scavenger does nothing on XDC standby cluster — how is secondary DB size supposed to stay bounded?"

Cleanup tool

Backlog Cleanup

Related topics