Workflow versioning corner cases

Hello Temporal team!

My team has recently faced some challenges with Temporal workflow versioning, and we wonder if you could have some suggestions on how to deal with them.

Recovering from an incompatible version deployment

In one of the cases we’ve deployed a new version of the workflow that was supposed to be backwards-compatible and passed our replay tests, but turned out to be incompatible in some rare scenarios, so some workflows started to fail with determinism exceptions (Event 27 of EVENT_TYPE_TIMER_STARTED does not match command COMMAND_TYPE_COMPLETE_WORKFLOW_EXECUTION).

We’ve deployed an update to the workflow that introduced a proper version via Workflow.getVersion() (and also downloaded some of the failed histories and made sure the updated workflow succeeds at replaying them).

That deploy has fixed the issue, but later we’ve got even more rare one (COMMAND_TYPE_SCHEDULE_ACTIVITY_TASK doesn't match EVENT_TYPE_ACTIVITY_TASK_SCHEDULED with EventId=231). Downloading the history of the workflow that failed this time shows that neither the original workflow, nor the new version can replay that history. It seems to come from a workflow that was running on version 1, then was running for some time on a (breaking) version 2, and then started to fail on version 3 that introduced a version check.

We wonder if

  • There could be a way to identify potentially erroneous workflows before they fail, as it seems like these secondary failures could happen hours or days after the deploy of version 3
  • What could be the proper way to fix these issues? Do we reset offending workflows to some points before the failure?

Versions in timer-triggered actions

One of our workflows is used to start some timers based on incoming signals and later send notifications. The logic of whether to send the notification when the timer is triggered has changed, so we added a Workflow.getVersion() call to the code that is triggered by the timer (via Workflow.newTimer(duration).thenApply()). It seems to work as expected, but we see a pretty high volume of errors coming from that workflow:

io.temporal.internal.replay.InternalWorkflowTaskException: Failure handling event 15 of 'EVENT_TYPE_MARKER_RECORDED' type. IsReplaying=true, PreviousStartedEventId=13, workflowTaskStartedEventId=20, Currently Processing StartedEventId=13
	at io.temporal.internal.statemachines.WorkflowStateMachines.handleEvent(WorkflowStateMachines.java:193)
	at io.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleEvent(ReplayWorkflowRunTaskHandler.java:140)
	at io.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleWorkflowTaskImpl(ReplayWorkflowRunTaskHandler.java:180)
	at io.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleWorkflowTask(ReplayWorkflowRunTaskHandler.java:150)
	at io.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTaskWithEmbeddedQuery(ReplayWorkflowTaskHandler.java:201)
	at io.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTask(ReplayWorkflowTaskHandler.java:114)
	at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:319)
	at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:279)
	at io.temporal.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:73)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.lang.IllegalStateException: Version is already set to 1. The most probable cause is retroactive addition of a getVersion call with an existing 'changeId'
	at io.temporal.internal.statemachines.VersionStateMachine.updateVersionFromEvent(VersionStateMachine.java:273)
	at io.temporal.internal.statemachines.VersionStateMachine.handleNonMatchingEvent(VersionStateMachine.java:325)
	at io.temporal.internal.statemachines.WorkflowStateMachines.handleVersionMarker(WorkflowStateMachines.java:302)
	at io.temporal.internal.statemachines.WorkflowStateMachines.handleCommandEvent(WorkflowStateMachines.java:250)
	at io.temporal.internal.statemachines.WorkflowStateMachines.handleEventImpl(WorkflowStateMachines.java:199)
	at io.temporal.internal.statemachines.WorkflowStateMachines.handleEvent(WorkflowStateMachines.java:178)
	... 11 common frames omitted

Affected workflows seem to eventually proceed past this error, but we wonder if we’re doing something wrong? Can we call getVersion from the timer callback methods?

Use the auto-reset feature to roll back the state of those workflows that are stuck in the limbo. The service will automatically detect all the workflows that are in this state and roll back their state to the before the bad deployment.

1 Like

Thank you, @maxim, that’s exactly what we were looking for. Seems like we did not populate binaryChecksum for our workers so the feature was not available for us. We’re now populating it with Git commit hashes.

As for the Workflow versions interacting with timers - I feel we might have faced some SDK corner cases. I’ll try to isolate it if I can and open relevant issues. For now, I’ve opened getVersion sometimes works incorrectly when used inside a Timer by GreyTeardrop · Pull Request #447 · temporalio/sdk-java · GitHub which seems to reproduce one of the issues we’ve got in our unit tests.

1 Like