Recommended Retry DLQ pattern in case of Temporal Cluster unavailability

Hi All,

We are in the process of evaluating Temporal as a workflow platform for one of the use case. In our existing architecture we have Kafka as the source of events. We are presently looking to see how can we deliver messages reliably to Temporal Cluster from Kafka in the case when Temporal cluster is down.

Our end to end delivery latency for one particular workflow is 2 hrs i.e. from the time the event is consumed from source topic we have 1 hr of time till the end of workflow execution & downstream delivery. Also we do not have any ordering requirements for this use case i.e. events can be out of order. Given the above we want to know what is the best practice for retries in case Temporal cluster is down.

The following are my thoughts -

  1. Publisher publishes events to Kafka Topic - SOURCE
  2. Kafka Consumer
    • Consumes event from the Kafka Topic SOURCE
    • Publishes event back to Temporal Cluster to kickstart the workflow
    • In happy case scenario all is well and there are no issues
  3. In case the Temporal cluster is down, we will face the following issues -
    • Kafka SOURCE topic will have lag if we keep retrying the event in the kafka consumer
      (till it succeeds posting to Temporal Cluster for next 1 hr) → hence this is not an option
    • In failure case Kakfa consumer copies the event to another Topic like RETRY queue. The Retry
      topic has its own consumer which tries to publish the message to Temporal Cluster immediately.
    • However there are 2 scenarios here
      • If Temporal cluster call was intermittent failure the RETRY consumer will succeed
      • If Temporal cluster was not a intermittent failure (longer failure) then we need to persist
        the event to a persistent storage
      • The persistent store will have meta data about when to retry the event, how frequently (e.g. 1 min)
        and how long (e.g. keep trying for 15 mins ).
      • The persistent store will be a source of truth for a scheduler which will schedule the RETRY jobs until
        success or timeout until failure
      • The system will have Alert when the RETRY jobs times out after 15 mins which gives Ops team
        45 mins opportunity to debug and investigate for any job.

I would like to know the communities thoughts on the above pattern/approach. I am thinking that may be for the RETRY scheduling we can use Temporal itself (although its not a workflow but serves the retry requirement). Please let me know if there are any other better patterns & your thoughts.

  • Kafka SOURCE topic will have lag if we keep retrying the event in the kafka consumer
    (till it succeeds posting to Temporal Cluster for next 1 hr) → hence this is not an option

I don’t understand why it is an issue to keep retrying messages until Temporal service is back. Kafka would have problem if only some of the messages are failing (poison pills). But if all of the messages are failing then it is OK to accumulate a backlog.