Unexpected error causing Out of Memory while running lifecycle workflows

Hi Team,

I have set of workflows which are designed to live for a longer time until its expiry and it is placed inside a while loop and continue/ retry in case of exceptions or errors.

If there are new code changes to the workflow results in already created lifecycle workflows fail with below error and resulting in an infinite retries piling up history events and ends up in out of memory.

Caused by: io.temporal.internal.replay.NonDeterminisicWorkflowError: Unknown CommandId{commandTarget=TIMER, commandEventId=17}. The possible causes are a nondeterministic workflow definition code or an incompatible change in the workflow definition.
at io.temporal.internal.replay.CommandHelper.getCommand(CommandHelper.java:698)
at io.temporal.internal.replay.CommandHelper.handleTimerStarted(CommandHelper.java:379)
at io.temporal.internal.replay.ReplayWorkflowExecutor.processEvent(ReplayWorkflowExecutor.java:248)
at io.temporal.internal.replay.ReplayWorkflowExecutor.handleWorkflowTaskImpl(ReplayWorkflowExecutor.java:474)
at io.temporal.internal.replay.ReplayWorkflowExecutor.handleWorkflowTask(ReplayWorkflowExecutor.java:403)
at io.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTaskWithEmbeddedQuery(ReplayWorkflowTaskHandler.java:168)
at io.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTaskImpl(ReplayWorkflowTaskHandler.java:145)
at io.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTask(ReplayWorkflowTaskHandler.java:104)
at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:301)
at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:273)
at io.temporal.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:73)

Where the worker is processing only retries and no new request is being processed by Temporal until we manually terminate and continue as new where we end up losing the internal states of the workflow.

We are not able to accomplish our business case if we design by the above mentioned way. Please advise.

Any code changes to the workflow that change the order of operations should be protected by versioning. See Workflow.getVersion API. Its documentation explains how it should be used.

Also, make sure to call continue as new periodically to limit workflow execution history growth. See the periodic workflow sample.

In the eventuality of a programming error like this which caused non determinsitic behaviour.
what will be a reasonable way of handling it
Workflow.retry vs workflow continue as new ? or is it best to let such workflows fail /cancel/terminate and possibly resubmit a new one.

The non deterministic behavior by design blocks workflow execution. So the simplest solution is to revert code deployment to the previous version which unblocks all the blocked workflows.