Hello Temporal team!
My team has recently faced some challenges with Temporal workflow versioning, and we wonder if you could have some suggestions on how to deal with them.
In one of the cases we’ve deployed a new version of the workflow that was supposed to be backwards-compatible and passed our replay tests, but turned out to be incompatible in some rare scenarios, so some workflows started to fail with determinism exceptions (
Event 27 of EVENT_TYPE_TIMER_STARTED does not match command COMMAND_TYPE_COMPLETE_WORKFLOW_EXECUTION).
We’ve deployed an update to the workflow that introduced a proper version via
Workflow.getVersion() (and also downloaded some of the failed histories and made sure the updated workflow succeeds at replaying them).
That deploy has fixed the issue, but later we’ve got even more rare one (
COMMAND_TYPE_SCHEDULE_ACTIVITY_TASK doesn't match EVENT_TYPE_ACTIVITY_TASK_SCHEDULED with EventId=231). Downloading the history of the workflow that failed this time shows that neither the original workflow, nor the new version can replay that history. It seems to come from a workflow that was running on version 1, then was running for some time on a (breaking) version 2, and then started to fail on version 3 that introduced a version check.
We wonder if
- There could be a way to identify potentially erroneous workflows before they fail, as it seems like these secondary failures could happen hours or days after the deploy of version 3
- What could be the proper way to fix these issues? Do we reset offending workflows to some points before the failure?
One of our workflows is used to start some timers based on incoming signals and later send notifications. The logic of whether to send the notification when the timer is triggered has changed, so we added a
Workflow.getVersion() call to the code that is triggered by the timer (via
Workflow.newTimer(duration).thenApply()). It seems to work as expected, but we see a pretty high volume of errors coming from that workflow:
io.temporal.internal.replay.InternalWorkflowTaskException: Failure handling event 15 of 'EVENT_TYPE_MARKER_RECORDED' type. IsReplaying=true, PreviousStartedEventId=13, workflowTaskStartedEventId=20, Currently Processing StartedEventId=13 at io.temporal.internal.statemachines.WorkflowStateMachines.handleEvent(WorkflowStateMachines.java:193) at io.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleEvent(ReplayWorkflowRunTaskHandler.java:140) at io.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleWorkflowTaskImpl(ReplayWorkflowRunTaskHandler.java:180) at io.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleWorkflowTask(ReplayWorkflowRunTaskHandler.java:150) at io.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTaskWithEmbeddedQuery(ReplayWorkflowTaskHandler.java:201) at io.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTask(ReplayWorkflowTaskHandler.java:114) at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:319) at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:279) at io.temporal.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:73) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source) Caused by: java.lang.IllegalStateException: Version is already set to 1. The most probable cause is retroactive addition of a getVersion call with an existing 'changeId' at io.temporal.internal.statemachines.VersionStateMachine.updateVersionFromEvent(VersionStateMachine.java:273) at io.temporal.internal.statemachines.VersionStateMachine.handleNonMatchingEvent(VersionStateMachine.java:325) at io.temporal.internal.statemachines.WorkflowStateMachines.handleVersionMarker(WorkflowStateMachines.java:302) at io.temporal.internal.statemachines.WorkflowStateMachines.handleCommandEvent(WorkflowStateMachines.java:250) at io.temporal.internal.statemachines.WorkflowStateMachines.handleEventImpl(WorkflowStateMachines.java:199) at io.temporal.internal.statemachines.WorkflowStateMachines.handleEvent(WorkflowStateMachines.java:178) ... 11 common frames omitted
Affected workflows seem to eventually proceed past this error, but we wonder if we’re doing something wrong? Can we call
getVersion from the timer callback methods?