Other Possible Causes of Non-Deterministic Error

Hi, I have read about some causes of NDE in this article Versioning | Legacy documentation for Temporal SDKs.
I’m wondering if there are also some other unstated causes.

  1. I tried to test some scenarios, and I found out that changing an activity’s response type could cause NDE (struct to string, vice versa).

  2. When we have a timer, can NDE occur if we increase/decrease the timer duration? Perhaps previously it is not timed out but then it times out in the newer version, or vice versa. If it is timed out, it will execute activity A. The selector is also waiting for a signal.

  3. Can adding, removing, or reordering workflow.UpsertSearchAttributes(ctx, attributes) cause NDE?

  4. Previously, timer is triggered every 00.00. In the new version, I include a variable TestFlag (from environment variable) to determine the timer duration value, whether it is 00.00 or x minutes (from environment variable). If it is timed out, it will execute activity A. The selector is also waiting for a signal. Can the addition of the flag cause NDE?

  5. If I have error handling as shown below, inside HandleActivityError(), it will loop and re-execute activity based on user input (retry/skip).


    In the newer workflow version, if I disable the activity error handling (comment line 44), can it cause NDE?

workflow.UpsertSearchAttributes(ctx, attributes) generates a command entry in workflow event history and changing the order will cause non-determinism. We recently faced some issue due to this.

IIUC, changing the timer value itself shouldn’t cause NDE. The workflow which went to sleep before the timer duration change will honour the sleep duration specified early on and wake up after that. The newer workflow executions which goes to sleep post time value change will honour the newer timer duration.

You can check the code that compares workflow commands here.

However, better to get confirmation from folks from the temporal team on this.

I see. But changing the attribute values (adding, removing, updating key-value attribute) should be safe, right?

Yes, check the code for the same here

Thank you for the reference.
I have another question: is it safe to change the order of workflow.GetVersion() invocation?

After deploying version 2, existing transaction gets non deterministic error.

It says ‘unknown command CommandType: Timer’

image

Timer 219 is a sleep operation:

I don’t understand why I 'm getting this error, because I did not change the order of executions that could cause NDE.