How to utilize Temporal resource efficiently while not overwhelming the system in long sleep() case

Hi I want to get some advice on how to utilize the Temporal resource at maximum while not overwhelming the system in below scenario:

Our use case is to send Emails to users, and each Email campaign can target millions of users, we’re going to create a workflow execution per campaign & user, in this workflow,

  1. We first check if now is a good time to send to this user, this is done by determining the next optimal send time with a bunch of data points fetched from other services.

  2. Then we calculate max(0, optimal send time - current timestamp) as the sleep duration, this sleep duration can be up to 24h.

  3. When the sleep duration is 0, we perform other checks (activities) and eventually send the Email.
    → Additional throttle can happen in the middle.

  4. Otherwise, the workflow execution sleeps for that duration and start from #1 again to fetch the optimal send time.
    → This wake up & re-execution can happen a few times until it reaches the max duration.

Let’s say with the provisioned resource our Temporal cluster can only handle roughly 1k concurrent workflow executions (this is not a strict limitation, but more of an estimate based on # of state transitions per workflow execution), and there are 2 Email campaigns, then each campaign can have 500 concurrent workflow executions as their quota.

Intuitively, we can record the current number of workflow executions per campaign in some data store, check the count before starting any new executions, and only start a new one (& incr the count) when the count is < 500, decr the count when the email is sent out.

However, this is not efficient because sleep is not consuming resources, ideally if 250 executions are in sleep state, we want to start another 250 executions instead of waiting for that 250 to finish first.

So here are a couple of thoughts:

  • Have some way to ‘release’ the quota when the workflow execution enters a ‘sleep’ state.

    • When it wakes up, does it need to re-acquire quota? Although we can add some jitter to prevent executions from being waked up at the same time, there could still be millions (up to hundreds of millions) of executions concurrently running in the system.
  • Let Temporal return the count of non-sleeping executions.

    • This would be better instead of us keeping the count in an external store, but I don’t think it’s available?
  • Use QPS instead of absolute count.

    • New executions can still be started, but the concern is similar to the first one as there could be a large amount of workflow executions running in the system.

Additional note here is that we’re also evaluating Cloud migration, will any of those concerns no longer be a problem if we migrate to Cloud?

Any suggestion on this kind of scenario? Thanks in advance!

IMHO, you are overthinking this. The scale of a Temporal cluster is proportional to the number of actions to be executed per second. The number of open workflows doesn’t really matter, assuming that they fit into the DB disk.

Temporal has protections against too many timers scheduled at the same time. But ideally, you want to spread them out. So avoid situations when millions of workflows schedule a timer to wake up at exactly the same moment. Otherwise, it should just work.

Yes our goal is to limit the number of actions / second, the reason we limit ‘# of open workflow’ is because that’s easier to control.

In our case, the traffic is dynamic and unpredictable, our cluster is provisioned with the size that works for avg load but not peak load for cost reasons.

For example, the cluster can handle 50k actions /sec with no issues, but in case of a large email send, theoretically there could be 100M workflows waken up within 10min (with some jitter), which is 167k qps, wouldn’t that cause a trouble to the cluster?