Job queuing design

I am adopting temporal to manage long-running jobs that are submitted to an HPC cluster. Roughly, I have a basic workflow that runs for each batch job submission does:

  1. submit job to HPC cluster
  2. watch for completion
  3. If successful, record details in external database
  4. If failed, resubmit job.

That little workflow runs for each job.

Now, I want to layer on the ability to manage a job queue that can:

  1. accept new job requests (in the form of a unique business id)
  2. avoid running a job that has been run before
  3. be able to report on the number of running, queued, and completed jobs
  4. be able to limit the total number of running jobs (based on querying about available resources)
  5. start new and get status of currently running jobs

The number of queued jobs could be large (500k), but each job is long-running, so temporal will not likely be stressed in any major way. Assuming that I am using a parent-child relationship and that the child works, any suggestions on what the parent would look like? I’m assuming that I need a continue_as_new approach? And how should I manage state of the queue, particularly since resource limits should be respected?

Thanks for any thoughts.

How many jobs per second maximum do you expect to start?

Thanks, @maxim. The rate-limiting step is that the external resource is a traditional slurm cluster where latency and job resources will lead to probably less than 10 jobs/minute.

Maybe to crystalize a bit further, I’m looking to be able to see metrics of job status and, potentially, about overall system performance (jobs/hour, etc.). I know I can capture all of this with an external state store. I’m really interested in the extent to which it is feasible and practical to do so just with temporal.

I don’t think you want to use the parent child relationship for this. A single Temporal workflow instance cannot support 500k simultaneous children.

So you have to start all the children as independent workflows. Each of this workflows would signal a “semphore workflow” to get permission to run. Upon receiving a response signal with permission the rest of the job is executed.

The semaphore workflow would receive the signals with requests and grant permissions with reply signals. As the state is too large to keep inside the workflow it would need an external DB to store statistics and the request queue using an activity. Then use another activity to decide which job should be granted permission based on the db data.

1 Like

Thanks, Maxim! I am rewarded for asking a question with an unexpected approach that answers the question but also extends my understanding with another Temporal paradigm that I hadn’t recognized (your semaphore workflow). So elegant and easily implementable!

This can be used as an inspiration: samples-go/mutex at main · temporalio/samples-go · GitHub.
Note that a single workflow has limited throughput, so don’t assume that you can use this design for high throughput scenarios.