Long running ML workflow interruption and update

Hi,

I have a long running (6+ hours) ML workflow made up of two activities, each training a machine learning model.

I would like to know how to best handle the following scenario:

  • A workflow is running. Activity 1 has completed, leading to model 1 being saved. Activity 2 starts running but gets interrupted because of a redeploy
  • Activity 2 restarts, but the new deploy changed the activity 2 worker, leading to a different model being trained
  • Once activity 2 completes, model 1 and model 2 are now incompatible

What is the best way to detect this situation, cancel and restart the workflow, so that model 1 and model 2 are retrained with the latest changes ?

EDIT

EDIT 2

  • A further complication in my case is that I will not be changing the Activities themselves (each activity calls a python training script which is the one that is modified) so versioning might be difficult to use.

Thanks !

Versioning solves a completely different problem of updating workflow code while there are open long running workflows.

Temporal supports activity task routing to specific hosts. So you should be able to enforce the execution of activity 2 on the same host as activity 1. If the host goes down the whole sequence can be reexecuted on a different host. See the fileprocessing sample that demonstrates this pattern.

Thanks Maxim !

Sorry, I don’t think I depicted my issue correctly:

  • In my setup, it is not important that all activities be carried out on the same host (because the outputs of upstream activities are uploaded upon activity completion and downloaded by subsequent activities).
  • My problem is as follows: my workflow runs in a docker image which contains the python code called by the activity. Some workflow updates will mean that the python code is updated in a non-backward compatible way. This means that resuming previously running workflows with the updated python code should be prevented.

My question is: how to best make sure previously running jobs are interrupted, and restarted with the updated version ?