Job queuing design

Sean_Davis · April 24, 2024, 8:01pm

I am adopting temporal to manage long-running jobs that are submitted to an HPC cluster. Roughly, I have a basic workflow that runs for each batch job submission does:

submit job to HPC cluster
watch for completion
If successful, record details in external database
If failed, resubmit job.

That little workflow runs for each job.

Now, I want to layer on the ability to manage a job queue that can:

accept new job requests (in the form of a unique business id)
avoid running a job that has been run before
be able to report on the number of running, queued, and completed jobs
be able to limit the total number of running jobs (based on querying about available resources)
start new and get status of currently running jobs

The number of queued jobs could be large (500k), but each job is long-running, so temporal will not likely be stressed in any major way. Assuming that I am using a parent-child relationship and that the child works, any suggestions on what the parent would look like? I’m assuming that I need a continue_as_new approach? And how should I manage state of the queue, particularly since resource limits should be respected?

Thanks for any thoughts.

maxim · April 24, 2024, 8:51pm

How many jobs per second maximum do you expect to start?

Sean_Davis · April 24, 2024, 10:22pm

Thanks, @maxim. The rate-limiting step is that the external resource is a traditional slurm cluster where latency and job resources will lead to probably less than 10 jobs/minute.

Maybe to crystalize a bit further, I’m looking to be able to see metrics of job status and, potentially, about overall system performance (jobs/hour, etc.). I know I can capture all of this with an external state store. I’m really interested in the extent to which it is feasible and practical to do so just with temporal.

maxim · April 25, 2024, 1:46am

I don’t think you want to use the parent child relationship for this. A single Temporal workflow instance cannot support 500k simultaneous children.

So you have to start all the children as independent workflows. Each of this workflows would signal a “semphore workflow” to get permission to run. Upon receiving a response signal with permission the rest of the job is executed.

The semaphore workflow would receive the signals with requests and grant permissions with reply signals. As the state is too large to keep inside the workflow it would need an external DB to store statistics and the request queue using an activity. Then use another activity to decide which job should be granted permission based on the db data.

Sean_Davis · April 25, 2024, 6:22am

Thanks, Maxim! I am rewarded for asking a question with an unexpected approach that answers the question but also extends my understanding with another Temporal paradigm that I hadn’t recognized (your semaphore workflow). So elegant and easily implementable!

maxim · April 25, 2024, 3:40pm

This can be used as an inspiration: samples-go/mutex at main · temporalio/samples-go · GitHub.
Note that a single workflow has limited throughput, so don’t assume that you can use this design for high throughput scenarios.

Topic		Replies	Views
What’s the best practice to integrate with external system invocation without polling? Community Support java-sdk	12	1701	April 21, 2022
Seeking Guidance on Temporal Application Design, scaling workflow queues as per load Community Support java-sdk	0	37	March 1, 2025
Temporal and Kafka Community Support java-sdk , task-queue , kafka	10	13989	June 21, 2023
Design for coordinator workflow with potentially large history Community Support go-sdk , cassandra	2	963	August 31, 2021
10K Machines, Each with One Python Worker, Each with Dedicated Queue Community Support python-sdk	4	103	February 26, 2025

Job queuing design

Related topics