Consider this workflow (I’m using python btw but I don’t think that should matter, so just writing pseudo-code):
do-activity-1
raise Exception('foo')
do-activity-2
So obviously, this is going to finish activity-1
and result in WorkflowTaskFailed
with reason being that exception foo
was raised. The workflow task will keep getting retried.
Now change code to:
do-activity-1
raise Exception('bar')
do-activity-2
and redploy the worker. From the worker logs I can see that it’s now raising the new exception, but the UI doesn’t update the status of event history, it remains frozen at WorkflowTaskFailed
with the reason that exception foo
occurred which is no longer accurate. This is just an example but it makes troubleshooting a bit difficult by looking things up in the UI. It’s as-if the worker was running stale code and wasn’t updated.
Then even introduce non-determinism by changing the code to the following and redeploy the (only) worker:
do-activity-3
raise Exception('bar')
do-activity-2
Again from the worker logs, I can see it quits execution as soon as it sees divergence (expects activity-1
to be completed by looking at event history but finds activity-3
in its place in new code). So it immediately detects non-determinism and quits the current execution but the UI for the workflow remains frozen with just the original error that workflow-task failed due to exception foo
.
Ofc if you now revert back to the 1st snippet and get rid of the exception, the worker now successfully completes the workflow, finishing activity-2
too now, and the UI updates with all that and finally showing the workflow as completed.
However in the meantime, due to lack of updates it makes troubleshooting a bit difficult. Is this an expected behaviour?