Temporal Workflows and/or Kubernetes Operator reconciliation loops?

Hi. This topic is not necessarily an either/or comparison, but I’m also interested in how to make a Temporal Workflow work like a Kubernetes Operator reconciliation loop.

My current use case involves automating a complex installation with roughly these steps:

  1. Install a Kubernetes Deployment into a cluster. (Right now, we’re using Helm, but any means would be fine.)
  2. Call a 3rd-party API with information from Step 1 and get some information needed by Step 3.
  3. Install a different Kubernetes DaemonSet into the same cluster, using information from Step 2.

Now, any of these steps can fail due to various reasons (networking issues, cluster down, API goes down, etc.). The fastest execution time for this whole workflow is about 10m. (We’re currently implementing it in another workflow engine, but let’s ignore that for the moment.) At worst, the whole thing can get “hung up” and wait for days or weeks, while we sort out the issues (which can be very complex in this environment).

If we implemented this as a Kubernetes Operator, then its reconciliation loop would keep retrying forever (*unless we give it timeouts). Also, a reconciler would be able to recover if the Deployment (Step 1) got removed AFTER the DaemonSet got installed. In this way, the Operator is self-healing. Also, it’s continually watching for deviations from the declared state (as expressed in a Kubernetes Custom Resource).

If we implemented this in Temporal, I can see how we can set the retries for a long time. But I don’t know how we can make it self-healing. If Temporal completes an Activity for Step 1, then the return value from Step 1 will be cached in the Temporal Cluster and considered to be equivalent to real-world state. The Temporal Workflow will continue with Step 2 and Step 3, but we could end up with part of the installation incomplete, and Temporal would not know. If I understand correctly, Temporal could complete a Workflow and report success, even if things have changed in the real world.

tl;dr- Is there a way to make a Temporal Workflow function like a self-healing Kubernetes Operator reconciliation loop?

Workflow is code that contains the business logic. So, if your business logic requires a self-healing reconciliation loop, you can code your workflow to implement such a loop.

For example, you can have an activity that constantly polls for changes in the value in Step 1. If this activity returns, then the workflow might cancel its current operations and re-execute the other steps from the beginning.