Design for Many Short Lived Different Size Workers

fogerd · December 14, 2023, 8:04pm

Looking for advice on best practices around dynamically provisioned, short-lived workers.

We have many different workflows that need to be run at the same time, but they have different resource requirements. Some require large amount of CPU, others large memory, some neither, etc.
The workflows are also based on changes a dev / user is making, so they need to be deployed for each run, can not be long lived based off a static image.

So instead of the normal long lived worker pattern, which can be horizontally scaled to handle more load, we need to have many different sized workers that are short lived. Wondering if there is a best pattern for managing these types of resources? Should the workflow itself being making a request for the resource and self-managing? Should we deploy all the needed workers at start and scale them down to 0 and scale up when see requests to their task queues / activities are starting? Not sure what other options there may be.

Thanks!

maxim · December 14, 2023, 9:14pm

It is hard to give specific advice without a deeper understanding of the use case.

The general idea is that provisioning workers can be part of the workflow itself.

awwx · December 16, 2023, 8:48am

@fogerd, when you say “The workflows are also based on changes a dev / user is making”, do you mean that after a dev/user change, the workflows need to be running different source code?

fogerd · December 18, 2023, 5:19pm

Correct. So the workers need to be re-deployed to pick up these changes

maxim · December 18, 2023, 7:18pm

I would package the new worker code into a docker image. Then a workflow would start a new job with that image to provision those workers.

awwx · December 19, 2023, 5:19am

Correct me if I’m wrong, but this sounds primarily a question of how to deploy code changes to workflow workers. If you can do that, then you can simply have a task queue for each resource requirement: one task queue for large CPU instances, another task queue for large memory instances, etc.

Consider a fleet of worker instances pulling work from a particular task queue. To make a code change, we can do a rolling update by starting new worker instances running the new code, while shutting down old instances running the old code.

For long running workflows that we wanted to start using new code, we’d need to migrate the workflow by patching. Since you don’t have long running workflows, it’s sufficient for the new code to use a new workflow type; that is, each change the dev/user makes maps to a different workflow definition name.

The new code can continue to support the old code for the old workflows under the old workflow definition type name until all the old workflows have been archived. Then the code for the old workflow definitions can be removed.

A constraint is that if a worker pulls a task from the queue and it turns out to be a workflow type that it doesn’t know about, it’ll fail that task. (What is a Temporal Worker? | Temporal Documentation)

I haven’t tested this, but since workflow tasks are simply logic, I imagine shutting down a workflow worker would normally be quick. You might find that performing a rolling update can be done fast enough that you could deploy and start using the new code.

If not, perhaps you might be able to use worker versioning to avoid having the old workers pull tasks for the new code.

Worker versioning isn’t available on Temporal Cloud yet, so finally you might consider using blue-green task queues (two queues for each resource type): initially all tasks for a resource type goes to its blue queue and all workers of that resource type pull from the resource type’s blue queue. When you have new code you can spin up worker instances listening to the green queue and start new workflow executions on the green queue. Once all the old workflows using the blue queue have completed, you can shut down the blue workers. (On the next code change it would be the green queue running the old code and the blue queue that would spin up to run the new code).

awwx · December 19, 2023, 6:57am

Oh wait, that doesn’t make sense. Why would you need to run high-CPU or high-memory workflow workers? I assume you’re referring to running activity tasks in high-CPU or high-memory instances? That would knock out the “just restart the workers quickly” approach, since activities can be longer running.

fogerd · December 19, 2023, 1:29pm

@awwx I think you missed the core point of the question - I’m looking for help on short lived workers. Not “a question of how to deploy code changes to workflow workers”. I don’t want to have two queues with large boxes of CPU / MEM up all the time since that costs $$$. I want to have the workflow spin up the necessary resources at runtime.
And yes this is for high resource activities, not workflows.

awwx · December 20, 2023, 1:01am

OK, let me know if I’m still misunderstanding here… now I’m imagining something like:

A dev/user triggers an execution by making a code change.
You spin up an expensive resource for that execution, such as a large CPU or high memory instance.
You run the user code in that instance.
When the code is finished, you shut down the expensive resource.

Alright, so say you have some control code which does the steps 2-4. E.g., if you’re spinning up EC2 instances, the code is making AWS API calls to launch and shutdown the instance. Etc.

Now, to run this with Temporal, you put the control code in an activity worker. The temporary expensive instance running the user code doesn’t have a Temporal activity worker running on it. Instead, the activity worker is in charge of spinning up the expensive instance, invoking the user code, and shutting down the instance. (Or however you need the process to work).

Now the activity worker is cheap. It doesn’t need to be running on an expensive instance itself. All it’s doing is making API calls. The activity worker’s code doesn’t change when the user code changes, all you need is to pass it a parameter to say which user code to run. The activity worker can be run as a long lived worker as is typical for Temporal.

maxim · December 20, 2023, 6:02am

I agree with @awwx.

An additional option is to run a worker on the “expensive instance.” This worker starts when the instance starts. So it allows sending activities to it in a flow controlled manner.

Topic		Replies	Views
Workers Lifecycle Community Support java-sdk , general-impl	1	68	December 17, 2024
Dynamic Worker Approach for Ad-Hoc Workflow Requests Community Support	5	873	December 14, 2023
How to manage growing code versions with long running workflows and Worker Versioning Community Support	9	47	February 28, 2025
PoC: Dynamic Activities and one worker per activity Community Support general-impl , kubernetes	3	972	January 25, 2022
Scaling Temporal: 400M Workflows with Continue-as-New Pattern Community Support go-sdk , general-impl	11	1709	August 7, 2023

Design for Many Short Lived Different Size Workers

Related topics