Worker versioning: Best practices when reverting the default version to an old Build ID?

A common pattern in our dev environment is for someone to deploy a test version of a microservice to a k8s namespace, do some testing, and then restore the stable version of the service. I’m trying to figure out how Worker versioning fits with this pattern. In our environment, we use simple rolling updates to deploy new microservices, and don’t currently have support for running multiple versions of a given microservice for arbitrary amounts of time. Because of this limitation, we’re trying to leverage Worker versioning using compatible versions in a single version set as much as possible, to avoid the need to run two parallel versions of the Worker. With these limitations in mind, I imagined our flow when handling this use-case would look like this:

  • We start with a “stable” Temporal Worker with Build ID “1”, registered on some task queue. “1” is set as the default version for the queue, with no other versions in the version set.
  • We launch a Workflow against that queue, allowing it to use the default version for the queue. The Workflow starts executing on the Worker.
  • We start a rolling update that deploys a “test” version of the Temporal Worker, registered on the same queue, using Build ID “2”. We add Build ID “2” to the same version set as “1”, and make “2” the new default. We use the Workflow versioning API to make all Workflows in Worker “2” backwards compatible with Workflows started on Worker “1”.
    • My observation in testing is that at this point, the Workflow we launched against “1” was still running, and switched to executing on Worker “2”. This is what I expected to see. Also at this point, the original Worker will started failing polls, with the error OUT_OF_RANGE: Task queue has a newer compatible build: "2". This is also what I expected.
  • Worker “1” shuts down as the rolling update completes.
  • We launch another Workflow against the queue, allowing it to use the default version for the queue. The Workflow starts executing on the Worker with Build ID “2”.
  • Before that Workflow completes, we decide to revert back to the stable version of the microservice. We redeploy Worker “1”, and promote Build ID “1” to again be the default in the version set

At this point, what I observed in testing is:

  • Worker “2” started failing with the error OUT_OF_RANGE: Task queue has a newer compatible build: "1".
    • This was surprising to me, since “2” was actually added to the set after “1”. I thought that “newer” meant “added to the set more recently”. It seems it might actually mean “marked as the default version more recently”?
  • Both Workflows that had been executing against “2” switched to executing against “1”.
    • This also surprised me, because it can lead to incorrect behavior; Worker “2” can have incompatible changes that won’t be handled correctly by Worker “1”, because it lacks the Workflow versioning API calls necessary to preserve determinism.

I was expecting to see the Workflows that ran on Worker “2” to get stuck due to there not being any compatible Worker running. I thought this would be the case because Worker “1”, while part of the same version set and now the default version in that set, was added to the set before version “2”, and is missing the WF versioning logic required to safely execute code that ran on version “2”. But it seems that adding versions to a version set is really indicating that all versions in the set are completely compatible with each other, both forwards and backwards, regardless of the order in which they were added. Is that correct?

Is there a better way to handle this use-case? In a dev environment, it might be ok to simply document that you should not promote an “old” version in a version set if it’s not forward compatible with a newer version that’s currently the default, at least while Workflows executing against the newer version are still running. If someone does it by mistake, you might get some failures, but that’s probably tolerable for dev.

I also wonder what the best practice would be in a similar use-case you might see in a prod environment, where you deploy a new version of a Worker that you have marked as compatible with an older version, but then realize the new version is buggy and you want to rollback. Can you do that in a way that causes Workflows that ran against the buggy version to stop, but still allows new Workflows to be launched against the old version of the Worker?

I guess one option would be to assign a new Build ID (say “3”) to the old Worker, put that in a new version set, and make it the default set. But in order to get the stuck Workflows going again, you’d have to fix the bug, then deploy the fix both in the new version set and the old one. So you’d deploy a Worker with the fix using Build ID “3.1” and add it to the “3” set: ("3", "3.1"), but also deploy the fix in a second instance of the Worker with the Build ID “2.1”, and make that the default in the old set: ("1", "2", "2.1"). You’d have to let the “2.1” Worker run alongside “3.1” Worker until all in-progress Workflows running on “2.1” complete, correct? Are there any other options here?