I have a long running (6+ hours) ML workflow made up of two activities, each training a machine learning model.
I would like to know how to best handle the following scenario:
- A workflow is running. Activity 1 has completed, leading to model 1 being saved. Activity 2 starts running but gets interrupted because of a redeploy
- Activity 2 restarts, but the new deploy changed the activity 2 worker, leading to a different model being trained
- Once activity 2 completes, model 1 and model 2 are now incompatible
What is the best way to detect this situation, cancel and restart the workflow, so that model 1 and model 2 are retrained with the latest changes ?
- Just found out about Workflow versioning (https://docs.temporal.io/docs/go/versioning) → would that be the recommended way ?
- A further complication in my case is that I will not be changing the Activities themselves (each activity calls a python training script which is the one that is modified) so versioning might be difficult to use.