Design for Many Short Lived Different Size Workers

Looking for advice on best practices around dynamically provisioned, short-lived workers.

We have many different workflows that need to be run at the same time, but they have different resource requirements. Some require large amount of CPU, others large memory, some neither, etc.
The workflows are also based on changes a dev / user is making, so they need to be deployed for each run, can not be long lived based off a static image.

So instead of the normal long lived worker pattern, which can be horizontally scaled to handle more load, we need to have many different sized workers that are short lived. Wondering if there is a best pattern for managing these types of resources? Should the workflow itself being making a request for the resource and self-managing? Should we deploy all the needed workers at start and scale them down to 0 and scale up when see requests to their task queues / activities are starting? Not sure what other options there may be.


It is hard to give specific advice without a deeper understanding of the use case.

The general idea is that provisioning workers can be part of the workflow itself.

@fogerd, when you say “The workflows are also based on changes a dev / user is making”, do you mean that after a dev/user change, the workflows need to be running different source code?

Correct. So the workers need to be re-deployed to pick up these changes

I would package the new worker code into a docker image. Then a workflow would start a new job with that image to provision those workers.

Correct me if I’m wrong, but this sounds primarily a question of how to deploy code changes to workflow workers. If you can do that, then you can simply have a task queue for each resource requirement: one task queue for large CPU instances, another task queue for large memory instances, etc.

Consider a fleet of worker instances pulling work from a particular task queue. To make a code change, we can do a rolling update by starting new worker instances running the new code, while shutting down old instances running the old code.

For long running workflows that we wanted to start using new code, we’d need to migrate the workflow by patching. Since you don’t have long running workflows, it’s sufficient for the new code to use a new workflow type; that is, each change the dev/user makes maps to a different workflow definition name.

The new code can continue to support the old code for the old workflows under the old workflow definition type name until all the old workflows have been archived. Then the code for the old workflow definitions can be removed.

A constraint is that if a worker pulls a task from the queue and it turns out to be a workflow type that it doesn’t know about, it’ll fail that task. (What is a Temporal Worker? | Temporal Documentation)

I haven’t tested this, but since workflow tasks are simply logic, I imagine shutting down a workflow worker would normally be quick. You might find that performing a rolling update can be done fast enough that you could deploy and start using the new code.

If not, perhaps you might be able to use worker versioning to avoid having the old workers pull tasks for the new code.

Worker versioning isn’t available on Temporal Cloud yet, so finally you might consider using blue-green task queues (two queues for each resource type): initially all tasks for a resource type goes to its blue queue and all workers of that resource type pull from the resource type’s blue queue. When you have new code you can spin up worker instances listening to the green queue and start new workflow executions on the green queue. Once all the old workflows using the blue queue have completed, you can shut down the blue workers. (On the next code change it would be the green queue running the old code and the blue queue that would spin up to run the new code).

Oh wait, that doesn’t make sense. Why would you need to run high-CPU or high-memory workflow workers? I assume you’re referring to running activity tasks in high-CPU or high-memory instances? That would knock out the “just restart the workers quickly” approach, since activities can be longer running.

@awwx I think you missed the core point of the question - I’m looking for help on short lived workers. Not “a question of how to deploy code changes to workflow workers”. I don’t want to have two queues with large boxes of CPU / MEM up all the time since that costs $$$. I want to have the workflow spin up the necessary resources at runtime.
And yes this is for high resource activities, not workflows.

OK, let me know if I’m still misunderstanding here… now I’m imagining something like:

  1. A dev/user triggers an execution by making a code change.
  2. You spin up an expensive resource for that execution, such as a large CPU or high memory instance.
  3. You run the user code in that instance.
  4. When the code is finished, you shut down the expensive resource.

Alright, so say you have some control code which does the steps 2-4. E.g., if you’re spinning up EC2 instances, the code is making AWS API calls to launch and shutdown the instance. Etc.

Now, to run this with Temporal, you put the control code in an activity worker. The temporary expensive instance running the user code doesn’t have a Temporal activity worker running on it. Instead, the activity worker is in charge of spinning up the expensive instance, invoking the user code, and shutting down the instance. (Or however you need the process to work).

Now the activity worker is cheap. It doesn’t need to be running on an expensive instance itself. All it’s doing is making API calls. The activity worker’s code doesn’t change when the user code changes, all you need is to pass it a parameter to say which user code to run. The activity worker can be run as a long lived worker as is typical for Temporal.

I agree with @awwx.

An additional option is to run a worker on the “expensive instance.” This worker starts when the instance starts. So it allows sending activities to it in a flow controlled manner.