Adding task orchestration functionality on top of workflows

I’m in the process of designing a SaaS that will allow users to run certain resource-intensive tasks. I’ve been reading through the Temporal docs to figure out how helpful it could be for me. Some of the functionality I need is supported by Temporal out of the box, but other stuff seems to require some work.

The requirements include:

  • The number of concurrent tasks is limited per user so that users aren’t starved by other users’ tasks.
  • Tasks can have priorities attached to them.
  • There are a few separate worker pools that can process different types of tasks. For example, certain tasks will need workers with more CPU power or GPU access, etc.
  • Worker pools need to scale up and down automatically.
  • “Tasks” will be implemented by different teams, possibly using different languages.

This is my current idea of how I could implement the above with Temporal:

  • Each “task” could be a Temporal workflow.
  • At the beginning of each workflow, each team will be required to start a certain child workflow (let’s call it Scheduler) with PARENT_CLOSE_POLICY_REQUEST_CANCEL, and then block waiting for a signal (let’s call the signal ReadyToGo). Conceptually, it will look like this:
    1. ExecuteChildWorkflow('Scheduler', PARENT_CLOSE_POLICY_REQUEST_CANCEL, ...)
    2. Wait for a ReadyToGo signal
    3. Run the rest of the task
  • The Scheduler workflow will:
    • Track the workflows that are currently being executed in a database.
    • Wait until the per-user task limit & priorities allow the workflow to be executed. When it happens, it will send the ReadyToGo signal.
    • When the parent workflow finishes and a cancellation is received, update the database with currently executing workflows.
    • Track and expose metrics needed for autoscaling different types of worker pools (I couldn’t find such metrics in Temporal).

My reasoning for why it should work is this:

  • The “tasks” (i.e., Temporal workflows) that cannot yet be executed won’t consume any resources besides some space in the DB/cache because they will be blocked on a signal.
  • The Scheduler workflow could wait on a signal, too. (If it waited on an activity instead, that activity would actively consume CPU/memory). The signal waking up the Scheduler would be sent from an external system (not running within Temporal).
  • Tracking which “tasks” are executing happens through a child workflow. An alternative approach would be to use two activities (e.g., Begin and End) instead, where the End one would always need to be called before completing the workflow. However, someone could forget to call End at the end. With child workflows, we don’t have to worry about this.

Does the above look feasible? I’m not sure if I’m not trying to misuse something or missed a better way to solve this. I’d really appreciate some advice.

I would reverse the parent and child. Make the parent perform resource-related actions and then invoke the child when the resource is granted.

Thanks!

I’ve been wondering if the blocked workflows won’t actually consume workers’ memory. I’ve read here that this isn’t a problem, but that reply was referring to the Java SDK, and I’ll be using Go and C#.

I’m wondering how that could even work in Go: to release a Goroutine that is running a workflow, the SDK would need to return a special error or panic inside ExecuteActivity and other blocking methods (and then catch that error).

When asked to release resources, a goroutine calls runtime.GoExit: