Not showing Non-Determinism error because worker blindly runs the activity without event-history reconciliation

Single worker, with workflow called Foo and activity called a0 registered. The workflow looks like:

# activity keeps polling some external state so keeps timing out and gets retried
await workflow.execute_activity(a0, 'x')

Next delete the entire workflow from code and deploy causing the worker container to restart. Expectation is that the “new” worker will poll the task queue, get the scheduled activity, doesn’t know the event-history so will ask the cluster, replay the event-history (skipping all the completed activities/timers/etc as usual) to reconcile to the end of it and then start this activity. If it did that it would immediately know that the event-history is totally different to what it knows about the workflow (in this extreme example it no longer even recognises this workflow as it was deleted in the code).

Instead what happens is that it blindly executes the given activity (since the activity functions are still registered with it) and only if this activity succeeds does it try to reconcile with event-history and then failing with unrecognised workflow error (or in other cases, non-determinism errors if the code was changed and so on).

Is this expected? It makes it difficult to reason about resets in other related cases. Eg. if the workflow was activity-0 -> activity-1 and it was currently retrying pending activity-1 and the new workflow with activity-0 -> activity-2 -> activity-1 was deployed, I would want the non-determinism error to immidiately pop up and then i can reset it back to the point after acitivity-0 so that it can perform activity-2 -> activity-1 and potentially pass because say activity-2 was the bug fix. However with the current behaviour, it will ignore all that and just blindly try activity-1 from the task queue and will never recover.

Hi @ustulation

Activity execution doesn’t need the event history or workflow context, they need the activity input, which is recorded in ActivityTaskScheduled event

Next delete the entire workflow from code and deploy causing the worker container to restart.

At which point of the execution did you change the code and restart the worker?

Instead what happens is that it blindly executes the given activity (since the activity functions are still registered with it) and only if this activity succeeds does it try to reconcile with event-history and then failing with unrecognised workflow error (or in other cases, non-determinism errors if the code was changed and so on).
Is this expected?

I think so, after the activity completes the server will put an WorkflowTaskScheduled in the event history that will make the worker/sdk to run the the workflow code. If the worker does not have the execution in cache it will need to replay it before continuing and if the code is non-deterministic you will get a NDE

Eg. if the workflow was activity-0 -> activity-1 and it was currently retrying pending activity-1 and the new workflow with activity-0 -> activity-2 -> activity-1 was deployed, I would want the non-determinism error to immidiately pop up and then i can reset it back to the point after acitivity-0 so that it can perform activity-2 -> activity-1 and potentially pass because say activity-2 was the bug fix. However with the current behaviour, it will ignore all that and just blindly try activity-1 from the task queue and will never recover.

In this case, you will get the NDE after activity1 completes/fails for the same reason I mentioned above.

If you need to stop the current running activities before resetting the workflow you can heartbeat from your activities and issue a cancellation request. There is an example here Cancellation - Python SDK feature guide | Temporal Documentation

Antonio

@antonio.perez ta! OK so it’s an expected behaviour then.

I just started with temporal yesterday so read a few docs and saw some videos and thought that every time the worker restarts (so clean cache) it would ask for the entire event history, irrespective of whether it plucked an activity or a workflow task from the queue. It seems it only does it for the latter while for the former it doesn’t. Is there any rationale to this? I think reconciling with event-history no matter what is better (if you didn’t have the event-history cached for the given activity/workflow task) and more intuitive? Then in the case above it would immediately error out with the non-determinism error (either because the entire workflow was gone so no point executing a pending activity of that workflow and then realise things are broken, or in the next example activity-2 was introduced so immediately flag non-determinism instead of finishing (which may take a long time anyway) activity-1 and then flag non-determinism. The worker has restarted anyway and so will the pending activity-1 so there’s no need to actually execute it before reconciling with event-history because an activity outside of a workflow (ie. seeing an activity in isolation) is (IMO) meaningless for a “workflow” orchestrator.

At which point of the execution did you change the code and restart the worker?

So in the 1st eg. I gave, the workflow has 1 activity a0 which is being executed but doesn’t finish as it’s just polling for something external (say a file) which is not yet available. It just keeps start-to-close timing out and retried. At this point I changed the code to just delete the workflow and its registration with the worker and redeploy the container (causing it to restart and lose all cache etc). I expected that the “new” (and only) worker will realise there’s a pending activity to execute but it doesn’t know anything about its workflow’s history and will ask that, try to reconcile and immediately throw up an error about unknown workflow. Instead it executes the activity and only once that activity completes successfully (eg I make that external file the activity is polling for available) does it try to make sense of the event history and then throws up an error.

It’s OK if that’s how it’s designed, I just got a different impression reading the docs (and that impression made more sense to me for the reasons mentioned above).

it would ask for the entire event history, irrespective of whether it plucked an activity or a workflow task from the queue.

That is impossible, as an activity can be executed on a worker who doesn’t know about the workflow that invoked that activity.

Also, it would be a performance hit. For example, this activity could take hours to execute. Delivering history and replaying it just to perform the determinism check on each activity retry is not practical.

I would recommend investing in testing infrastructure to avoid situations when nondeterministic changes are deployed to production.

That is impossible, as an activity can be executed on a worker who doesn’t know about the workflow that invoked that activity.

What’s the use of this though? It’ll need the history as soon as it’s done with the activity anyway. At that point it’ll fail and a worker who knows about the workflow needs to pick it up. Why not just have the worker that knows about the workflow pick it up in the first place? Why is it impossible? As I said, an activity outside the context of a workflow doesn’t sound meaningful anyway, wouldn’t you agree?

Also, it would be a performance hit. For example, this activity could take hours to execute. Delivering history and replaying it just to perform the determinism check on each activity retry is not practical.

This I understand and I had it at the back of my mind. However how bad of a hit is this? The replay should be fairly quick (mostly reconciliation) and (not 100% sure but IMO) in vast majority of the cases, compared to the activity duration it’s going to be a tiny/insignificant fraction. The workflow tasks seem to have a very low tolerance for delays anyway (the UI starts complaining if the worker took more than 2 secs, saying that the worker seems to have blocked the async loop when using python). Do we frequently encounter workflow tasks which don’t complete in milliseconds mostly?

And you don’t have to do it on each retry. You do it on each retry when you have lost your cache, which would happen when a new deployment is done, which in-turn is a signal that it’s a good time to look at what’s new/changed etc. because that’s the point of deployment anyway - it’s a conscious decision on the part of the changeset deployer (whoever that is).

I would recommend investing in testing infrastructure to avoid situations when nondeterministic changes are deployed to production.

That was not my point. There’s no non-determinism from a dev’s perspective. I am using the term as it’s coined by temporal. When I uploaded the new code in the example above, it might very well have been deliberate because an activity was missed and needs to be plugged in and executed before the current activity being retried as that’s “part of the fix” or else the current activity never succeeds. Once deployed I would want the UI to shout back at me about all the “non-determinism” failures (assuming that workflow is being run for 1000s of workflow-ids) so that I can reset them all to appropriate points, instead of me hunting for them after the deployment.

In any case, all those would fail anyway if they did manage to complete the current activity. Why not just fail on the next retry of the activity already, given that a new deployment was done and all workers have their cache reset (so a good time to ask for history)? Or have a feature switch to allow devs to make that choice.


Probably needless to say, but I’m just brainstorming given my limited understanding of temporal and its design decisions. Just in-case it comes off as otherwise.

Imagine a workflow that starts 500 activities in parallel which execute at 100 different machines. Do you want to keep workflow loaded at each of these machines?

I personally don’t support the approach of “deploying breaking changes without thinking and then fixing the breakage.” I would rather devise a way to make all the changes so that nothing is broken in production.

Right, I got a chance to play with temporal a bit more over the weekend and I think I understand better now. What I thought was happening was that every worker that runs an activity would, after running the activity, need to fetch the event history for reconciliation before running the next activity, which is why I was confused why not just ask it straight away to begin with instead.

But now I see that the workflow event-history is actually present with only one worker while others execute the activity. The event history (and workflow tasks) are communicated by another task queue which is not the main queue, until the worker crashes/doesn’t respond to the workflow task being scheduled. So from the logs I can see that even though I have 5 parallel workers executing many activities, there’s only 1 worker that deals with event-history and workflow-tasks at any given time.

If that’s true, then I agree that giving activity-workers the event history would be an overhead.

I have also come across another “hidden” UI switched on by the button at left bottom called “labs-on”. That appears to address the problems about updates and visibility a bit which I was struggling with in temporal in that it shows an auto-updating timeline graph with all, successful + pending + erroring etc, activities bundled in with different colours.

With these I think I should be good for now. The only thing that I currently miss now is the dagster like flow visualisation in temporal (the whole “flowchart” drawn up font and then as the workflow progresses, branches that were taken coloured in green and those skipped coloured in grey etc), but then temporal has no idea about the workflow-graph (end-to-end) and just knows about it as it processes it. So (unless there’re still more things I don’t know yet or other UI buttons I missed that help do this) I don’t think there’s a way (or will ever be) that allows me to have such a UI. Anyway, this thread wasn’t for that so OK to ignore this.