Hi there. Thanks in advance for your help.
Our use case: We have very long running jobs (1-12 hours) that we run as part of a larger workflow. The code for these jobs is entirely owned by us, so we have flexibility in how we manage the jobs.
We are thinking to use a very quick activity to trigger/launch the job, which is handled by a totally different process. Then, the job itself will signal the workflow when it is done. I’ve seen this pattern mentioned in a few other places, so I think this is a good approach (but please correct me if I am wrong). [EDIT: after some more reading, maybe this is a prime use-case for Async Activity Completion and i don’t need signals at all? Looks like I could use RecordActivityHeartbeat() to heartbeat from the spawned process]
The complication: It is possible that this async job will crash, in which case we would want the workflow to relaunch the job. So, I want the workflow to be able to detect when/if the job has crashed.
I was thinking we could have the job’s code use Signaling as a sort of “heartbeat”. Basically, the job would say “i’m alive” once every 2 minutes or so. The workflow could select on the signal channel and if it doesn’t receive a signal after e.g. 3 minutes, a timer fires, it assumes the job has crashed and it launches a new job (which will start signaling again).
Is this a valid approach? Any potential issues that I should look out for? I looked around for examples of using Signals as a kind of heartbeat but failed to find this use case (apologies if I missed documentation). Note that this would be simpler if the job was a long-running activity (we could just use temporal’s heartbeating) but I’ve seen advice to use quick activities plus signaling where possible.