Understanding how workflow "Replay" is working

Hello,
I am learning Temporal and usign Java SDK.

Trying to understand how the workflow is being “replayed” upon a worker continuing the workflow execution.

I read the The SDK and Temporal Cluster relationship section of the docs. If I got it right, after a successful exection of an activity by a worker, the results are stored, a new event is created in the task queue - and a new available worker is allocated to continue the workflow execution.

For that worker: what will be the “entry point” (in Java code) for the execution? Will it call the workflow method (i.e. method that calls activities and glues things together)? Is it correct to assume that the “replay” means that if an activiity was previosly executed, the SDK is smart to take the results?

I tried to verify the idea (using “Money Transfer” sample) by adding a log print to the 1st line:

@Override
public void transfer(String fromAccountId, String toAccountId, String referenceId, double amount) {

System.out.println(“Hello World!”);
account.withdraw(fromAccountId, referenceId, amount);
account.deposit(toAccountId, referenceId, amount);
}

When running the worker in my environment, I saw only one “Hellow World!” log, so seems like the method is executed only once.

Am I missing something essential? Or is it a development server only behavior? Can anyone provide an in-depth execution flow of the app code when replay is performed?

Thanks in advance.

We have tested that when the current execution running on a worker is terminated, the workflow replay is triggered by the history events.

In such cases, the replay will be executed on the same or another worker, and the workflow method will be called again. However, any steps or code calls in this method that have already been captured in the event history will not reach the actual activity or other Temporal external events. This is because the replay will advance to the Java line code that was not captured in the event history, and the replay will continue from there.

For instance, if an activity is already captured in the event history, the workflow method code that calls it will just receive the output of the activity captured in the history.

If the code is changed and the execution path does not match the stored event history, the workflow replay will terminate with an error stating that the workflow method code is inconsistent with the history.

Hello, thanks for your help!
Unfortunately, I still have a fundamental gap to cover.

From the documentation I mentioned, it appears like when a worker executing a workflow needs to call an activity, an event is added to the task queue, and the current worker is terminated. Once the activity was successfuly executed, [potentially] another worker continues the workflow, and as you write, “the workflow method will be called again”. [in my specific case, the workflow method is ‘transfer’]

If that’s the case, the log being placed on the 1st line of the workflow method - will be executed after the 1st activity [in my case: withdraw] will be executed, and before getting to the 2nd activity [in my case: deposit]. However, in the test, I see only one log message . It looks like the workflow method [in my case: transfer] is continued from the same place it was stopped, and not like the workflow method was called again by another worker.

What am I missing?

Thanks
Max.

Hi.
Now I learned that there is a concept of “Sticky execution”, an optimization that perfers the same worker (if available) to resume any workflow execution.
Is it correct to say that in the case of “Sticky execution” the workflow code continues from the place it was blocked, while if the worker is not the “sticky one” - the execution begins from the 1st line of the workflow method?
If so, “Sticky execution” explains why the method is called only once, so the log prints out only once.

It is highly optimized. I did a lot of debugging and validated that the workflow method thread is not killed on each activity call. It is only killed if there is a need to free resources, such as when the workflow method hits Temporal sleep for a long time or encounters other similar events.

Also, use the Temporal Slf4J wrapper for Java, as it stores the logs in the history and does not log them if the workflow needs to be replayed.

For instance: final Logger log = Workflow.getLogger(MyWorkflowImpl.class);

Hi @max001
Interesting question. I’m having same question for you and this is my idea after investigation.
As you said, it’s mode “Sticky execution”. Normally, when worker received activity task completed, it will re-execute workflow from scratch, and it will check status activity before execution.
But for mode sticky execution, each workflow has in-memory storage, it includes many things. When worker received activity task, instead of re-execution, it will push it to channel (im using Golang), you think it’s a way to notify to activity that activity has completed, that’s why it looks like execution begins from the 1st line of the workflow .

Trying to go through the whole thread here

If I got it right, after a successful exection of an activity by a worker, the results are stored, a new event is created in the task queue - and a new available worker is allocated to continue the workflow execution.

Activity worker that executes your activity code reports activity completion/failure to service, yes.
In order to deliver this completion/failure to your workflow worker (to continue workflow execution) service has to do it via a workflow task. Service will try to deliver this workflow task to the same worker that has so far been processing the execution, so to its worker-specific (or sticky) task queue. So no, it would not be a new available worker, service will instead try to send it to specific worker, and then if its no longer available forward the workflow task to non-sticky task queue so it can be dispatched to any available workers.

For that worker: what will be the “entry point” (in Java code) for the execution? Will it call the workflow method (i.e. method that calls activities and glues things together)? Is it correct to assume that the “replay” means that if an activiity was previosly executed, the SDK is smart to take the results?

As you correctly mentioned, workers do cache workflow executions (workflow threads) in their in-memory cache. When a workflow task is dispatched to a worker, worker needs to check if it has this workflow exec in its cache (as it could have been evicted if this worker has processed the exec before). If it has it, it can just apply events in the workflow task (such as activity results) and continue processing your workflow code. If it does not, then yes, worker would need to re-instate the execution in order to apply the new events and continue execution. This is whats called event history replay, and this is where worker also checks if your workflow code is deterministic.
The entry point is your workflow method, so beginning of your workflow code, and also you are correct, worker would run your workflow code from beginning, and check the commands your business logic generates (commands like schedule activity exec, start timer, …) with what is in this execs event history. During replay worker would not re-execute already completed/failed child workflows.

When running the worker in my environment, I saw only one “Hellow World!” log, so seems like the method is executed only once.

Yes, this is expected behavior if your execution starts and completes on same worker and this worker never had to evict this exec from its cache during its execution.
Otherwise you would see your “Hello World” log even multiple times possibly. You should use workflow logger instead of system outs as its “replay-safe”, just fyi

Can anyone provide an in-depth execution flow of the app code when replay is performed?

Normally, when worker received activity task completed, it will re-execute workflow from scratch, and it will check status activity before execution.

As mentioned, it does that only if it has to, meaning worker does not have the workflow execution in its in-memory cache (so it cannot just continue it after applying events in the task).

In addition to Tihomir’s explanation, I will mention that our Temporal 102 training covers this in some depth. We offer it for multiple SDKs, so you’ll see code examples in your preferred language. Like our other courses, there’s no cost for this training—it’s completely free.

The Understanding Event History chapter in Temporal 102 explains (and demonstrates) the relationship between your Workflow code, the commands that the Worker sends to the Temporal Service, and the Event History. The Understanding Workflow Determinism chapter builds on that to explain (and demonstrate) how replay works by illustrating how a Worker reconstructs application state following a crash.

While there’s a lot of helpful content throughout that course, even just those two chapters will give you a pretty solid understanding of how Temporal works and will accelerate any further exploration. For example, once you understand how replay works, you’ll be able to evaluate whether you need to use versioning to safely deploy a given change to production. Learning the techniques for versioning will be easier since you’ll understand the problem it’s meant to solve.