At my $DAYJOB, I am tasked with designing a distributed pixel streaming system with GPU-bearing nodes functioning as the task runners and doing the hard work. I decided I should use a stateful and durable workflow engine like Temporal to achieve a fault oblivious system as our GPU nodes can be expected to fail at any time for reasons incl. but not limited to power loss, network & hardware failure, planned OS & kernel upgrades etc. In short, our tasks must resume when these nodes come back up and running.
I am planning to have a workflow execution per task runner (which we add through an internal REST API), and have the actual tasks spawned as child workflows through signals (which we send also from our internal API), and then let the parent continue-as-new. In theory, this gives us the ability to manage the lifecycle of task runners (i.e. parents) together with their tasks (i.e. childs) in our company dashboard (e.g. stop task runner, so cancellations cascade and all related tasks are also terminated) and through automated means (e.g. if task runner stops sending heartbeats to our internal API, trigger a cancellation of the parent workflow).
Do you think this is the way I should design it? I feel like I am overengineering this a little bit, so I am asking for general advice here.
Do you ever have situations when no tasks are running? Or there are cases when at least one task per GPU is running?
The caveat is that after workflow calls continue-as-new the abandoned children will not be automatically canceled if the parent completes. So you need to implement their cleanup yourself using signals.
Does this mean with continue-as-new, the parent completes and its children are in a detached state? Could not using continue-as-new be an option here, given the fact we expect a relatively low event traffic in Temporal so we are unlikely to hit its limits?
Yes, task runners can be completely idle with no tasks scheduled with them. How does Temporal run out of history in this case when it is solely waiting for a signal to arrive to create a child workflow?