Running ML training on temporary cloud instance

Hello everyone !

I’m just starting with temporal (which looks awesome), and I’m trying to solve a specific use case for a POC.
We are running long ML training jobs in docker/python. I wish I could build a workflow that:
1 - Spawn a cloud instance (B)
2 - Start a docker/python job (activity?)
3 - Output job logs in cadence UI (activity?)
4 - Kill the instance

Context:

  • We use go-sdk
  • We do not use kubernetes yet …
  • We are using GCP

Hypothesis:

  • Temporal service would run on an instance A
  • Temporal main worker would run on instance A
  • I should dynamically start a temporal worker on instance B to wrap/monitor docker/python job locally rather than remotely through SSH. But I suspect there is something wrong with the idea, since I cannot guarantee local job/worker binding when there is multiple jobs running in parallel.

Any ideas on the best practice to achieve that ? (I might be thinking wrong)
Thank you very much, have a great day.

  • Temporal service would run on an instance A
  • Temporal main worker would run on instance A
  • I should dynamically start a temporal worker on instance B to wrap/monitor docker/python job locally rather than remotely through SSH. But I suspect there is something wrong with the idea, since I cannot guarantee local job/worker binding when there is multiple jobs running in parallel.

This is trivial to achieve with Temporal. The idea is to use the instance specific Task Queue to route tasks to a specific worker. So when worker starts on instance B it listens on “B-” task queue and the workflow schedules activities on that task queue.

Thanks Maxim, indeed as I understand more and more, this looks trivial with different task queues.
Have a good day.

1 Like