Trouble with non-determinism

I’m having some trouble understanding why I am receiving workflow panics for non-determinism and hoping someone can help. Here is the situation:

The existing code executes a workflow with three activities (A, B, C)

I wanted to make a change to add a workflow sleep in activity A. To do that, I used this code:

	vers := workflow.GetVersion(ctx, "sleep1", workflow.DefaultVersion, 1)

	if vers != workflow.DefaultVersion {
		if errSleep := workflow.Sleep(ctx, 2*time.Second); errSleep != nil {
			logger.Errorf("unable to sleep workflow: %w", errSleep)
		}
	}

This works fine for new workflows.
But what I am seeing is that for existing running workflows that are in a retry state on activity C, I am getting the workflow panic for non-deterministic behavior.

I looked at the docs and it does say that a change in sleep duration can cause this.

So my question is - how can activity C cause a panic when the code to do the sleep is in Activity A and not even executed since the workflow is in Activity C? Also, the code is properly wrapped with versioning (I think) so there should not be a panic. Hoping someone knows the answer as we need to use versioning more going forward. Thank you.

Can you share the workflow history for this execution please?
tctl wf show -w <wfid> -r <runid> --of myhistory.json (and share the json)

how can activity C cause a panic when the code to do the sleep is in Activity A

When you restarted your workers their in-memory execution cache is cleared and in order to continue your exec workers need to get the full workflow history thats available so far and replay your workflow code against the history. See this forum post and this video for more info.

Thanks for this. I had reviewed both of those items before writing the code. In fact, the code largely came from the article. I do understand that Temporal has to re-create the workflow history when it starts again. But I have a check for the version in there so it should not have tried to do the sleep/timer function.

In the attached log, the workflow started a few days ago, and the beginning activities worked fine. The last step is to deliver a webhook which failed. We retry that 7 times over a few days. It’s in this retry that it had the panic. The code was moved in during that span of time but that’s what we’re trying to prove.- that the current running workflows will not panic which is why we have the version check in there.

Here is the JSON, I just picked a random one that was panicking as there are many of these.
Thanks

Could share your full workflow code please (dm me if you cannot share it in this post).

Is the code you pasted in first post the first thing thats executed in your workflow function?
Is your sleep time hard-coded to 2 seconds as shown or do you calculate the sleep duration somehow?

I will send the code via DM.

  1. yes, that is the first thing executed
  2. It is hard-coded to 2 seconds

hi @tihomir , any input on the code?

Yes, and I’m a bit confused, you mentioned that you are adding the workflow.Sleep via versioning, also I assume that the workflow history you shared is one of the already running executions before you added that change, can you please confirm?

Looking at the workflow history you have:

"eventId": "5",
"eventType": "TimerStarted",

which according to your code is

vers := workflow.GetVersion(ctx, "sleep1", workflow.DefaultVersion, 1)
if vers != workflow.DefaultVersion {
if errSleep := workflow.Sleep(ctx, 2*time.Second); errSleep != nil {
	logger.Errorf("unable to sleep workflow: %w", errSleep)
	}
}

And there is no marker recorded event in history.
To me this looks like the workflow code that this worker first ran (in this case “identity”: “1@internal-webhooks-7748bcc4b5-6v5zf@”) when execution was started was

workflow.Sleep(ctx, 2*time.Second);

and there was no versioning directive when it ran it.

My guess is that you might have had one worker at least that had this workflow.Sleep in code before you made the change with versioning.
From the history it seems that your workflow ran first time with workflow.Sleep for 2 seconds, then this worker might have crashed or was shut down, and workflow execution was migrated to worker
“identity”: “1@internal-webhooks-8596fd8967-zpsfs@”
which did have the updated code, it had to replay workflow history against the updated code and failed because the history was on default version (there was no marker event to move it to version 1) but now the workflow.Sleep was only ran on version 1.

My guess is also that if you update your code to:

if vers != workflow.DefaultVersion {
	if errSleep := workflow.Sleep(ctx, 2*time.Second); errSleep != nil {
		logger.Errorf("unable to sleep workflow: %w", errSleep)
	}
} else {
   workflow.Sleep(ctx, 2*time.Second)
} 

and restart workers it would unblock this particular execution (from your shared wf history)

Also I think that your activity options need to be looked over, specifically you setting ScheduleToCloseTimeout and then also limiting retries via maximumAttempts.
If you wish dm me and we can find some time to go over these on zoom if you think that would be helpful. Would also watch this video on activity timeouts.

I will definitely dm you later today to see if we can go over these items. My first impression is that you are right. Here is what I now believe happened:

  1. Workflows were running with no sleep timer (group A)
  2. I added an unconditional two-second sleep timer
  3. This caused workflows from Group A that were in a retry state to fail with non-deterministic behavior
  4. I changed the code to add versioning and only sleep if not the default version
  5. But Group A workflows still in retry were now the second version behind so the IF statement executed the Sleep for them continuing the non-deterministic behavior.

My main questions will be:

  1. If the above is true, then the versioning logic should work in production because we never moved the unconditional sleep to production due to the panics
  2. Workflows having non-deterministic behavior are still running and not failing well past their timeout period
  3. The activity option point you raised above

Thank you, Tihomir. I will ping later

For WorkflowReplayer see test here.
Workflow Check tool here.

Thanks again, Tihomir. We moved the code to production and it worked fine. Thanks for the personalized help.