Recover From Non-Deterministic Failure With Minimum Downtime


We’re trying to represent an object lifecycle with a long running workflow that can stay running for months.

We’re trying to solve the issue if a non-deterministic error occurred due to a change that was added to the workflow that was not properly handled through versioning. We accept that this might eventually happen at some point no matter how careful we were, but what we do not accept is having to lose days of progress by having to rerun the workflow.

We looked into the workflow reset option to solve the issue. Suppose we merged a breaking change, and a non-deterministic error occurred. We could reset the workflow to a point before this change was added, but suppose the new change was at the beginning of the workflow. Then the reset won’t help much because we’re still rerunning days’ worth of progress.

Using worker versioning doesn’t work for us as well.

Another thing we looked into was if once this non-deterministic issue was identified; we could fix our code and add proper versioning branches that should’ve been before. This, however, could mean that production would be down for at least a day when multiple workflows are hanging which is something we cannot afford.

Our proposed solution to this whole issue is to use the continue as new option and rerun the whole workflow without looking into history. But in addition to that, we’ll store extra information in the database about the progress of our workflow and add a new “Step 0” at the beginning of the workflow. This new step will run before rerunning the workflow business logic from scratch. It will be used to fetch some data we stored that would help us know where our workflow stopped and jump to that stage by adding extra logic (This means that we’re reinventing the versioning ourselves in addition to storing extra data in the db).

Is there something form temporal features that we missed that could solve our issue, without having to reinvent versioning ourselves?

Do you test your changes to workflow code using workflow replayer? It should give you are pretty good indication if you changes would break determinism as well as can make it part of your test suite to catch already nondeterministic code that could trigger NDE without any changes in certain situations.

Could you explain why detecting issue and fixing would take what seems much longer than detecting and then trying to reset? Wither way you would have to restart your workers to apply fix.

Also, which Temporal SDK are you using?

Hello, thank you for the reply.

Running tests for code changes using replayer was something we considered. However, we’re not very confident it might be able to detect all possible cases of things that can go wrong, hence why we are trying to come up with a design that avoids having non-deterministic errors altogether.

Our approach is not to detect and reset. What we’re trying to do now is write the workflow from the start around the case that we would manage replaying workflow “steps” rather than relying on temporal to do it for us. We’ll do that by making sure that once a new worker is up and running; existing workflows that should’ve been replayed by temporal do not rely on temporal and we disregard any history, and we restart the workflow. However, rather than having the restart re-execute all the steps that were completed before the worker went down, we rely on some information we store in the db about the progress of our workflow and let our code handle how it’s gonna jump steps that were already played and do not replay them. This would entail storing extra information in the db, in addition to adding extra code

Could you advise on the right practices for writing tests that rely on the replayer to avoid the non-deterministic error?
Our main concern with it is that we’re not very confident it will detect all possible things that can go wrong from the running workflows, and we cannot afford that, hence we’re trying to avoid the problem altogether. Maybe if we use it right, we can be sure to avoid the issue and avoid reinventing what’s already provided by Temporal. Note that we have multiple environments (dev, staging, and production), and introducing new changes that didn’t lead to non-deterministic error for any existing workflow on dev might not mean that the changes won’t break production.

We’re also using the Java SDK.