Issue: Worker Saturation When Re-pushing Large Volume of DLQ Workflows in Self-Hosted Temporal

KomalGarg · December 25, 2025, 8:33pm

Hi Temporal community ,

We are running a self-hosted Temporal setup and have implemented a custom DLQ mechanism as a safety net for workflow start failures.

Current Design

If a workflow fails to start / enter the Temporal server, we retry it up to 3 times.
If it still fails after retries, we push the workflow payload into a DLQ.
A scheduled process picks up all entries from the DLQ and re-publishes them to Temporal every 30 minutes (at :00 and :30).

Problem We Are Seeing

Sometimes, when there is a large backlog in the DLQ, a big burst of workflows gets re-pushed into Temporal at once.
After this happens:
- Workers become completely unresponsive
- No workflows are picked up or progressed
- The system only recovers after:
  - Restarting worker pods, or
  - Scaling up the number of worker pods

This feels like some form of worker saturation, task queue backlog, or resource exhaustion, but we are not fully sure what we are missing in our design.

What We Are Looking For

We’d really appreciate guidance on:

What are the anti-patterns in the above DLQ + bulk re-push approach?
Is burst re-publishing a known problem with Temporal task queues or workers?
Should we be:
- Rate-limiting workflow starts from DLQ?
- Using separate task queues for DLQ retries?
- Leveraging Temporal retry policies / backoff instead of external DLQ retries?
- Tuning worker concurrency, pollers, or task queue settings?
Are there Temporal-native patterns for handling “workflow start failures” and delayed retries more safely?

Environment (if relevant)

Self-hosted Temporal
Kubernetes-based workers
Workers recover only after pod restart or horizontal scaling

Any insights, best practices, or references would be very helpful.
Thanks in advance for your support

maxim · December 26, 2025, 5:57pm

Your design is not DLQ, it is rather retry queue. A well configured Temporal cluster has pretty high availability so such queue is not really needed.

I recommended “queue in front” approach when the traffic is very spiky and can the starts can overwhelm the cluster. Then queue absorbs the writes and the queue consumer calls starts using a rate limiter.

Topic		Replies	Views
Temporal self-hosted instance hangs when hundreds of workflows are processing concurrently Server Deployment	0	86	April 29, 2025
Temporal is slow to start burst of 1000s of workflows Server Deployment go-sdk	0	206	January 22, 2025
Mass Workflow bursts cause occasional ContextDeadlineExceeded errors Community Support typescript-sdk	9	1678	November 8, 2022
Configure Retrying workflow execution after temporal-worker outage Community Support java-sdk , retries	1	888	November 2, 2022
Use-cases and questions Community Support	4	3390	January 5, 2021

Issue: Worker Saturation When Re-pushing Large Volume of DLQ Workflows in Self-Hosted Temporal

Current Design

Problem We Are Seeing

What We Are Looking For

Environment (if relevant)

Related topics