Hi I want to get some advice on how to utilize the Temporal resource at maximum while not overwhelming the system in below scenario:
Our use case is to send Emails to users, and each Email campaign can target millions of users, we’re going to create a workflow execution per campaign & user, in this workflow,
-
We first check if now is a good time to send to this user, this is done by determining the next optimal send time with a bunch of data points fetched from other services.
-
Then we calculate
max(0, optimal send time - current timestamp)
as the sleep duration, this sleep duration can be up to 24h. -
When the sleep duration is 0, we perform other checks (activities) and eventually send the Email.
→ Additional throttle can happen in the middle. -
Otherwise, the workflow execution sleeps for that duration and start from #1 again to fetch the optimal send time.
→ This wake up & re-execution can happen a few times until it reaches the max duration.
Let’s say with the provisioned resource our Temporal cluster can only handle roughly 1k concurrent workflow executions (this is not a strict limitation, but more of an estimate based on # of state transitions per workflow execution), and there are 2 Email campaigns, then each campaign can have 500 concurrent workflow executions as their quota.
Intuitively, we can record the current number of workflow executions per campaign in some data store, check the count before starting any new executions, and only start a new one (& incr the count) when the count is < 500, decr the count when the email is sent out.
However, this is not efficient because sleep is not consuming resources, ideally if 250 executions are in sleep state, we want to start another 250 executions instead of waiting for that 250 to finish first.
So here are a couple of thoughts:
-
Have some way to ‘release’ the quota when the workflow execution enters a ‘sleep’ state.
- When it wakes up, does it need to re-acquire quota? Although we can add some jitter to prevent executions from being waked up at the same time, there could still be millions (up to hundreds of millions) of executions concurrently running in the system.
-
Let Temporal return the count of non-sleeping executions.
- This would be better instead of us keeping the count in an external store, but I don’t think it’s available?
-
Use QPS instead of absolute count.
- New executions can still be started, but the concern is similar to the first one as there could be a large amount of workflow executions running in the system.
Additional note here is that we’re also evaluating Cloud migration, will any of those concerns no longer be a problem if we migrate to Cloud?
Any suggestion on this kind of scenario? Thanks in advance!