Running ML training on temporary cloud instance

MorganK · July 26, 2021, 3:38pm

Hello everyone !

I’m just starting with temporal (which looks awesome), and I’m trying to solve a specific use case for a POC.
We are running long ML training jobs in docker/python. I wish I could build a workflow that:
1 - Spawn a cloud instance (B)
2 - Start a docker/python job (activity?)
3 - Output job logs in cadence UI (activity?)
4 - Kill the instance

Context:

We use go-sdk
We do not use kubernetes yet …
We are using GCP

Hypothesis:

Temporal service would run on an instance A
Temporal main worker would run on instance A
I should dynamically start a temporal worker on instance B to wrap/monitor docker/python job locally rather than remotely through SSH. But I suspect there is something wrong with the idea, since I cannot guarantee local job/worker binding when there is multiple jobs running in parallel.

Any ideas on the best practice to achieve that ? (I might be thinking wrong)
Thank you very much, have a great day.

maxim · July 27, 2021, 5:52pm

Temporal service would run on an instance A

Temporal main worker would run on instance A

I should dynamically start a temporal worker on instance B to wrap/monitor docker/python job locally rather than remotely through SSH. But I suspect there is something wrong with the idea, since I cannot guarantee local job/worker binding when there is multiple jobs running in parallel.

This is trivial to achieve with Temporal. The idea is to use the instance specific Task Queue to route tasks to a specific worker. So when worker starts on instance B it listens on “B-” task queue and the workflow schedules activities on that task queue.

MorganK · July 28, 2021, 11:57am

Thanks Maxim, indeed as I understand more and more, this looks trivial with different task queues.
Have a good day.

Topic		Replies	Views
Build temporal microservice Community Support go-sdk	2	986	August 31, 2022
Need Help in understanding worker to server communication Community Support helm , cassandra , kubernetes	10	2113	June 5, 2022
How can i scale my temporal on EC2? Community Support general-impl	9	1164	March 14, 2023
Running a Temporal Worker as K8 Job Server Deployment python-sdk , general-impl	1	892	September 5, 2023
How to use Temporal for Machine Learning Workflows Community Support python-sdk	8	3180	March 6, 2023

Running ML training on temporary cloud instance

Related topics