We’re running Temporal 1.31 with XDC replication across two Aurora PostgreSQL clusters. Our secondary is 3.6TB vs our primary at 419GB after months of healthy replication.
We can see temporal-sys-history-scanner-workflow starting on the secondary, but the task queue scavenger immediately cleans up temporal-sys-history-scanner-taskqueue because no workers are polling it — our application workers only connect to the primary. The scanner starts, schedules activities, they never get picked up.
Is the history scavenger expected to be a no-op on standby clusters? If so, what’s the intended mechanism for keeping secondary DB size bounded over time?
The history scavenger is NOT expected to be a no-op on standby clusters. It runs on every cluster independently.
The scavenger does not use your application workers, it is executed by temporal’s system worker.
The cleanup mechanism on each cluster should work as:
Retention triggered deletion will delete the mutable states.
Scavenger clean up orphans left behind in history_tree.
From the table sizes, it’s likely that orphan branches accumulating and scavenger isn’t reclaiming them.
Could you help to verify on following things so we can confirm before changing anything in prod:
SELECT
COUNT(*) AS total_branches,
COUNT(*) FILTER (WHERE e.run_id IS NULL) AS orphan_branches,
ROUND(100.0 * COUNT(*) FILTER (WHERE e.run_id IS NULL)
/ NULLIF(COUNT(*), 0), 2) AS pct_orphan
FROM history_tree ht
LEFT JOIN executions e ON e.run_id = ht.tree_id;
Can you run the SQL on both active and passive side, we should expect a high percentage of orphan branches on passive.
Check the scavenger task queue and workflow on secondary.
Check scavenger metrics on the secondary. Then prometheus queries you could use:
sum by (cluster) (rate(scavenger_success_total{operation="HistoryScavenger"}[30m]))
sum by (cluster) (rate(scavenger_errors_total{operation="HistoryScavenger"}[30m]))
Thanks — the scavenger is running fine on the secondary. Here’s what we found (current as of today):
total_branches
orphan_branches
pct_orphan
primary
7,425,622
165,385
2.23%
secondary
170,095,104
214,827
0.13%
The secondary has a lower orphan rate than primary — the scavenger is working. The size difference is not orphans.
The ~170M non-orphan branches on secondary all have matching rows in executions on secondary, so the scavenger correctly leaves them alone. But primary only has 7.4M total branches — meaning primary has already scavenged history for ~162M workflows that secondary still holds in full.
So the question becomes: why doesn’t secondary’s retention timer clean up those ~162M non-orphan branches after the retention period expires? Our understanding is that the standby timer queue executor discards retention timer tasks rather than executing them (timer_queue_standby_task_executor.go). Is that correct, and if so, is Archival the intended solution for bounding secondary DB size?
Thanks for running the queries. The data is actually really informative, and it confirms the scavenger is running fine on secondary.
Comparing with your previous size table, it’s likely that you have roughly the same workflow count on both clusters, but many more branches per workflow on the secondary. To confirm, could you help to run this on clusters:
SELECT
(SELECT COUNT(*) FROM executions) AS workflow_rows,
(SELECT COUNT(DISTINCT tree_id) FROM history_tree) AS unique_trees,
(SELECT COUNT(*) FROM history_tree) AS branch_rows,
ROUND(1.0 * (SELECT COUNT(*) FROM history_tree)
/ NULLIF((SELECT COUNT(DISTINCT tree_id) FROM history_tree), 0), 2)
AS avg_branches_per_workflow;
On the other two points you have:
I think retention is running on secondary. If it weren’t, executions would be far larger than 48 GB.
Archival isn’t the size-bounding mechanism here. It offloads closed-workflow history to long-term storage; it doesn’t touch in-flight branches, where the branches can link to an execution
If the SQL confirms the per-workflow branch skew, I’d want to narrow down what’s driving it. Some quick questions on your operational profile over the last few months:
Any temporal workflow reset calls (in tooling or ad-hoc) — roughly what frequency?
Any namespace failovers or active-cluster changes — frequency, and were workflows progressing during the cutover?
Typical workflow lifetime: short (minutes/hours) or long-lived (days/weeks)?
Primary has 1.00 branches per workflow — secondary has 7.71. Confirmed skew.
On your three questions:
Resets: two mechanisms — manual resets roughly once/week, and an automated detector that resets scheduler workflows when they get genuinely wedged (pending task stuck in SCHEDULED state for 5+ minutes, with a 1-hour cooldown per workflow). The auto-reset fires on a rare failure condition, not routinely. Not sure ever fires enough to explain the 7.71x branch accumulation?
Failovers: none — active cluster has been constant on primary the entire time.
Workflow lifetime: minutes to hours, short-lived.
Given resets are rare, we’re not sure what’s driving the branch accumulation on secondary. Happy to run any additional queries to help narrow it down.
Can you check the counts of workflow past retention.
-- run on BOTH clusters
SELECT
state,
COUNT(*) AS rows,
COUNT(*) FILTER (WHERE close_time < NOW() - INTERVAL '30 days') AS past_retention
FROM executions
GROUP BY state;
Retention is 1080h (45 days) on both clusters, configured identically. One thing we noticed is that secondary shows ReplicationConfig.State: Unspecified while primary shows Normal. Not sure if that’s significant.
Unfortunately we can’t run the past-retention query as written against our schema:
close_time doesn’t exist as a column in executions
state is a protobuf blob, so GROUP BY state returns millions of unique rows rather than a useful breakdown by status
Is there an alternative query we can run to get the same information?
On manual deletions: we have an automated janitor that runs batch deletes via the Temporal API every 15 minutes on completed workflows. No ad-hoc manual deletions beyond that.
Ah — I think we’ve found the major driver of the 3× workflow-row gap on secondary.
DeleteWorkflowExecution API calls aren’t replicated to passive clusters by default. We only recently addressed this in 1.31 (Replicate workflow deleteion by jiechenz · Pull Request #9717 · temporalio/temporal · GitHub), but it’s still behind a feature flag — history.enableDeleteWorkflowExecutionReplication, which defaults to false. So
your janitor’s been only cleaning up workflows on primary only, while secondary holds onto the 45-day retention timer fires.
If you’d like to verify quickly: I think you may grab any workflow ID from your janitor’s audit log that should have been deleted recently, and check whether it still exists on the passive cluster:
tdbg --address workflow describe –workflow-id --run-id or a direct query on passive DB.
If it’s still there on passive but gone from primary → confirmed.
Suggested next steps:
Enable the flag on your clusters: set the feature flag to true
That’ll make future janitor runs propagate to passive correctly, stopping the gap from growing any further.
Clean up the historical backlog on passive.
I’m guessing you don’t want to wait 45 days for it to drain naturally. The cleanest option is a one-shot script that mirrors your janitor’s delete pattern against secondary’s admin endpoint.
One important note: make sure your script targets secondary directly and bypasses DCRedirection, otherwise the calls will get forwarded back to primary and won’t do anything useful on secondary. The header to disable redirection is: –grpc-meta xdc-redirection=false
Once the above is done, let’s revisit DB sizes and tackle the 7× branches-per-workflow issue separately.
On the namespace describe output: could you double-check that active_cluster_name and the clusters list match across both sides? If they’re out of sync, that’d point to something like an active-active situation, which would be a real concern. If those two match, the ReplicationState: UNSPECIFIED flag is most likely benign (probably a metadata-format artifact) and we can come back to it later.
Let us know how the clean up goes and happy to help you with any questions!
We sampled 1,000 executions rows from secondary shard 5 and cross-checked against primary’s persistence layer. 327/1,000 (32.7%) exist on secondary but are absent from primary - the janitor deleted them from primary, secondary still holds them.
Not active-active. The Unspecified state on secondary is the only difference — flagging it in case it’s relevant.
Next steps from our side:
We’ve already enabled history.enableDeleteWorkflowExecutionReplication: true on dev primary. We’ll roll it to staging and production once we confirm it’s working.
For the historical backlog, you mentioned a one-shot cleanup script targeting secondary directly with --grpc-meta xdc-redirection=false. Could you share more detail on what that looks like? Specifically: does it use tdbg or the standard temporal CLI, and is there a way to scope it to workflows older than a given date to avoid touching anything active?
To clean up on passive side, two options I can think of:
Create a shell script that run temporal workflow list (with xdc-redirection=false) to get the list of workflows and runID. You may run a DB query on passive to get the info as well. Then use tdbg to delete the workflow on passive only.
Please see following examples:
# list workflows from passive
temporal --address <passive-endpoint> workflow list \
--namespace <your-namespace> \
--query 'ExecutionStatus != "Running" AND CloseTime < "2026-06-05T00:00:00Z"' \
--grpc-meta xdc-redirection=false \
--output json \
| jq -r '.[] | "\(.execution.workflowId)\t\(.execution.runId)"'
# delete on passive
./tdbg --address <passive-endpoint> --namespace <your-namespace> workflow delete --workflow-id <wf-id> --run-id <run-id>
Note: validate on a small slice first, then ramp up gradually while watching DB load on passive.
Create a cleanup workflow. Given the size of backlog, I think it’s a more production-grade approach. Please see a following sketch as example.
CleanupWorkflow
func noRedirect(ctx context.Context) context.Context {
return metadata.AppendToOutgoingContext(ctx, "xdc-redirection", "false")
}
type WorkflowKey struct{ Namespace, WorkflowID, RunID string }
type Activities struct {
PassiveWF workflowservice.WorkflowServiceClient // passive frontend to list workflows
PassiveAdmin adminservice.AdminServiceClient // passive admin to delete workflows
}
func (a *Activities) ListPassive(ctx context.Context, in ListInput) (*ListOutput, error) {
resp, err := a.PassiveWF.ListWorkflowExecutions(noRedirect(ctx),
&workflowservice.ListWorkflowExecutionsRequest{
Namespace: in.Namespace,
Query: in.Query, // "ExecutionStatus!='Running' AND CloseTime<'2026-05-01T00:00:00Z'"
PageSize: int32(in.PageSize),
NextPageToken: in.Cursor,
})
if err != nil {
return nil, err
}
out := &ListOutput{NextCursor: resp.NextPageToken}
for _, e := range resp.Executions {
out.Items = append(out.Items, WorkflowKey{
Namespace: in.Namespace, WorkflowID: e.Execution.WorkflowId, RunID: e.Execution.RunId,
})
}
return out, nil
}
func (a *Activities) DeletePassive(ctx context.Context, key WorkflowKey) error {
_, err := a.PassiveAdmin.DeleteWorkflowExecution(ctx,
&adminservice.DeleteWorkflowExecutionRequest{
Namespace: key.Namespace,
Execution: &commonpb.WorkflowExecution{WorkflowId: key.WorkflowID, RunId: key.RunID},
})
var notFound *serviceerror.NotFound
if err == nil || errors.As(err, ¬Found) {
return nil
}
return err
}