Workflow Versioning - Unknown DecisionId

We’ve ran into the same issue twice now with workflow versioning and have not been able to figure out the reason why it is happening. The following is the error we receive:

Unknown DecisionId{decisionTarget=ACTIVITY, decisionEventId=24}. The possible causes are a nondeterministic workflow definition code or an incompatible change in the workflow definition.

This happens when we deploy workers with new workflow code. We have been using Workflow.getVersion(...) successfully most of the time, but sometimes it causes most of our workflows to fail with the above error. I have read the documentation here, and it seems like we are doing everything correctly.

The one thing I noticed was that the few times this has happened, we deleted a previous unused getVersion call and added a new one in the same deployment. The previous one was leftover from an update to the workflow from a couple weeks ago, and we had just not swept it yet. There was already no branching logic that used the version ID returned from the previous call.

Could the reason this is happening be because of the fact that we are sweeping one getVersion call while adding one in the same update? Is it a best practice to do those separately? From the documentation above:

This usually means workflow history is corrupted due to some bug. For example, the same activity can be scheduled and differentiated by activityID. So ActivityIDs for different activities are supposed to be unique in workflow history. If however we have an ActivityID collision, replay will run into this error.

I can see why adding a new activity without the getVersion call would result in an ActivityId collision when replaying. However, could removing the getVersion call result in a collision as well?

Example of previous code:

activity1();
activity2();
Workflow.getVersion("change-id-1", 1, 1);
activity3();
activity4();
...

New code:

final int versionId2 = Workflow.getVersion("change-id-2", Workflow.DEFAULT_VERSION, 1);
if (versionId2 != Workflow.DEFAULT_VERSION) {
    activity0();
}
activity1();
activity2();
if (versionId2 == Workflow.DEFAULT_VERSION) {
    activity3();
} else {
    activity35();
}
activity4();
...

What SDK version are you using?

We are using the java client version 2.7.8.

So it is Cadence Java Client.
I’m not able to reproduce the problem using the Temporal Java SDK. I believe that Cadence had bugs around removing getVersion calls that were fixed as part of Temporal SDK internals rewrite.

Sorry, I should have specified that in the original post. Do you recall what the bugs were? Would it be better to post in the Cadence Slack?

AFAIK there were a few edge cases around adding and removing getVersion calls. I don’t think we fixed them in the legacy Cadence SDK.

Would it be better to post in the Cadence Slack?

Yes, at this point the Cadence team is the only one who can help with this issue.

Got it, thank you Maxim!