Hi Temporal community
,
We are running a self-hosted Temporal setup and have implemented a custom DLQ mechanism as a safety net for workflow start failures.
Current Design
-
If a workflow fails to start / enter the Temporal server, we retry it up to 3 times.
-
If it still fails after retries, we push the workflow payload into a DLQ.
-
A scheduled process picks up all entries from the DLQ and re-publishes them to Temporal every 30 minutes (at
:00and:30).
Problem We Are Seeing
-
Sometimes, when there is a large backlog in the DLQ, a big burst of workflows gets re-pushed into Temporal at once.
-
After this happens:
-
Workers become completely unresponsive
-
No workflows are picked up or progressed
-
The system only recovers after:
-
Restarting worker pods, or
-
Scaling up the number of worker pods
-
-
This feels like some form of worker saturation, task queue backlog, or resource exhaustion, but we are not fully sure what we are missing in our design.
What We Are Looking For
We’d really appreciate guidance on:
-
What are the anti-patterns in the above DLQ + bulk re-push approach?
-
Is burst re-publishing a known problem with Temporal task queues or workers?
-
Should we be:
-
Rate-limiting workflow starts from DLQ?
-
Using separate task queues for DLQ retries?
-
Leveraging Temporal retry policies / backoff instead of external DLQ retries?
-
Tuning worker concurrency, pollers, or task queue settings?
-
-
Are there Temporal-native patterns for handling “workflow start failures” and delayed retries more safely?
Environment (if relevant)
-
Self-hosted Temporal
-
Kubernetes-based workers
-
Workers recover only after pod restart or horizontal scaling
Any insights, best practices, or references would be very helpful.
Thanks in advance for your support