Temporal for computationally intense burst workloads

Hi! I just found out about temporal and have been pretty blown. It will likely cause a paradigm shift in development over the years. I want to get the benefits at the current company I work at, but I’m not sure if temporal is a good fit. We run highly GPU intensive video transcoding and 3d modeling jobs at a customers request (automated and triggered with a web API). Currently we are highly invested in aws step functions, aws lambda, as well as aws batch and serverless in general to keep our prices down, while being able to scale horizontally infinitely. Provisioning many GPU workers for temporal likely wont work for us and likely be wasteful given we don’t need the compute to be running 24/7. I’m going to assume for our use case, temporal isn’t a good fit. Am I wrong? We have many steps in our step function definitions, and managing the yml is becoming too much for us, so temporal would be a great solution. Are there other customers who have more burst-y serverless workloads similar to ours using temporal?

I think Temporal is a perfect fit for your use case.

Provisioning many GPU workers for temporal likely wont work for us and likely be wasteful given we don’t need the compute to be running 24/7.

Temporal supports routing activity tasks to specific pools of workers or even individual workers. The way I would solve this problem is by having activities that provision the workers as part of your workflow definitions. Another option is to have a separate workflow that implements autoscaling of GPU workers.

Thanks for the response. I’m still trying to understand the model. From what I understand, the workers need to be persistent to poll the queue. Is the thinking I could use a worker to spin up a GPU worker and add the work to the queue for it to pick up?

Correct. You still need some always-running workers to host workflow code and activities that can control other worker processes. But I doubt that these would be expensive to run.

Got it. And just to be clear, the new spin up GPU instances can have temporal code on them, and will be tracked?

You are going to run so-called Temporal worker processes on these instances. A worker process is essentially a queue consumer that listens on a queue hosted inside the Temporal service using long poll gRPC requests. So Temporal doesn’t track those instances directly; they ask for work instead.