Single worker, with workflow called Foo
and activity called a0
registered. The workflow looks like:
# activity keeps polling some external state so keeps timing out and gets retried
await workflow.execute_activity(a0, 'x')
Next delete the entire workflow from code and deploy causing the worker container to restart. Expectation is that the “new” worker will poll the task queue, get the scheduled activity, doesn’t know the event-history so will ask the cluster, replay the event-history (skipping all the completed activities/timers/etc as usual) to reconcile to the end of it and then start this activity. If it did that it would immediately know that the event-history is totally different to what it knows about the workflow (in this extreme example it no longer even recognises this workflow as it was deleted in the code).
Instead what happens is that it blindly executes the given activity (since the activity functions are still registered with it) and only if this activity succeeds does it try to reconcile with event-history and then failing with unrecognised workflow error (or in other cases, non-determinism errors if the code was changed and so on).
Is this expected? It makes it difficult to reason about resets in other related cases. Eg. if the workflow was activity-0 -> activity-1
and it was currently retrying pending activity-1
and the new workflow with activity-0 -> activity-2 -> activity-1
was deployed, I would want the non-determinism error to immidiately pop up and then i can reset it back to the point after acitivity-0
so that it can perform activity-2 -> activity-1
and potentially pass because say activity-2
was the bug fix. However with the current behaviour, it will ignore all that and just blindly try activity-1
from the task queue and will never recover.