Recommended Retry DLQ pattern in case of Temporal Cluster unavailability

Adi_V · June 1, 2023, 6:23pm

Hi All,

We are in the process of evaluating Temporal as a workflow platform for one of the use case. In our existing architecture we have Kafka as the source of events. We are presently looking to see how can we deliver messages reliably to Temporal Cluster from Kafka in the case when Temporal cluster is down.

Our end to end delivery latency for one particular workflow is 2 hrs i.e. from the time the event is consumed from source topic we have 1 hr of time till the end of workflow execution & downstream delivery. Also we do not have any ordering requirements for this use case i.e. events can be out of order. Given the above we want to know what is the best practice for retries in case Temporal cluster is down.

The following are my thoughts -

Publisher publishes events to Kafka Topic - SOURCE
Kafka Consumer
- Consumes event from the Kafka Topic SOURCE
- Publishes event back to Temporal Cluster to kickstart the workflow
- In happy case scenario all is well and there are no issues
In case the Temporal cluster is down, we will face the following issues -
- Kafka SOURCE topic will have lag if we keep retrying the event in the kafka consumer
  (till it succeeds posting to Temporal Cluster for next 1 hr) → hence this is not an option
- In failure case Kakfa consumer copies the event to another Topic like RETRY queue. The Retry
  topic has its own consumer which tries to publish the message to Temporal Cluster immediately.
- However there are 2 scenarios here
  - If Temporal cluster call was intermittent failure the RETRY consumer will succeed
  - If Temporal cluster was not a intermittent failure (longer failure) then we need to persist
    the event to a persistent storage
  - The persistent store will have meta data about when to retry the event, how frequently (e.g. 1 min)
    and how long (e.g. keep trying for 15 mins ).
  - The persistent store will be a source of truth for a scheduler which will schedule the RETRY jobs until
    success or timeout until failure
  - The system will have Alert when the RETRY jobs times out after 15 mins which gives Ops team
    45 mins opportunity to debug and investigate for any job.

I would like to know the communities thoughts on the above pattern/approach. I am thinking that may be for the RETRY scheduling we can use Temporal itself (although its not a workflow but serves the retry requirement). Please let me know if there are any other better patterns & your thoughts.

maxim · June 2, 2023, 3:20pm

Kafka SOURCE topic will have lag if we keep retrying the event in the kafka consumer
(till it succeeds posting to Temporal Cluster for next 1 hr) → hence this is not an option

I don’t understand why it is an issue to keep retrying messages until Temporal service is back. Kafka would have problem if only some of the messages are failing (poison pills). But if all of the messages are failing then it is OK to accumulate a backlog.

Topic		Replies	Views
Kafka orchestration Community Support go-sdk , kafka	5	3525	December 9, 2021
Planning a production deployment Community Support production	3	837	April 6, 2021
Thoughts on implementing Kafka exactly-once messaging integration Community Support	9	1226	February 10, 2025
Not able to retry the previous activity if gets false acknowledement Community Support java-sdk , temporal-cloud	5	22	June 10, 2025
Temporal Retry Async Activities Community Support java-sdk , retries	3	329	January 22, 2024

Recommended Retry DLQ pattern in case of Temporal Cluster unavailability

Related topics