Long running ML workflow interruption and update

tmain · May 19, 2021, 8:48am

Hi,

I have a long running (6+ hours) ML workflow made up of two activities, each training a machine learning model.

I would like to know how to best handle the following scenario:

A workflow is running. Activity 1 has completed, leading to model 1 being saved. Activity 2 starts running but gets interrupted because of a redeploy
Activity 2 restarts, but the new deploy changed the activity 2 worker, leading to a different model being trained
Once activity 2 completes, model 1 and model 2 are now incompatible

What is the best way to detect this situation, cancel and restart the workflow, so that model 1 and model 2 are retrained with the latest changes ?

EDIT

Just found out about Workflow versioning (https://docs.temporal.io/docs/go/versioning) → would that be the recommended way ?

EDIT 2

A further complication in my case is that I will not be changing the Activities themselves (each activity calls a python training script which is the one that is modified) so versioning might be difficult to use.

Thanks !

maxim · May 19, 2021, 3:41pm

Versioning solves a completely different problem of updating workflow code while there are open long running workflows.

Temporal supports activity task routing to specific hosts. So you should be able to enforce the execution of activity 2 on the same host as activity 1. If the host goes down the whole sequence can be reexecuted on a different host. See the fileprocessing sample that demonstrates this pattern.

tmain · May 19, 2021, 4:15pm

Thanks Maxim !

Sorry, I don’t think I depicted my issue correctly:

In my setup, it is not important that all activities be carried out on the same host (because the outputs of upstream activities are uploaded upon activity completion and downloaded by subsequent activities).
My problem is as follows: my workflow runs in a docker image which contains the python code called by the activity. Some workflow updates will mean that the python code is updated in a non-backward compatible way. This means that resuming previously running workflows with the updated python code should be prevented.

My question is: how to best make sure previously running jobs are interrupted, and restarted with the updated version ?

Topic		Replies	Views
Need help with temporal versioning production issue Community Support java-sdk , general-impl , versioning	1	115	November 26, 2024
How to manage growing code versions with long running workflows and Worker Versioning Community Support	9	56	February 28, 2025
Cancelling workflows created from an activity python sdk Community Support	6	122	November 23, 2024
What to do when an activity cannot proceed without re-running previously completed activities? Community Support python-sdk	4	59	December 20, 2024
How to use Temporal for Machine Learning Workflows Community Support python-sdk	8	3157	March 6, 2023

Long running ML workflow interruption and update

Related topics