I’m trying to understand how Workflow.sleep works/implemented internally, and how Temporal wakes up slept workflow on scheduled time (or after certain sleeping period). It seems to me that Temporal has some sort of scheduler/tick system built-in to achieve that ?
Say if we have thousands/millions of sleeping workflows that scheduled to be wake up at same time, can Temporal handle such cases ? e.g. a simple workflow
function workflow () {
Workflow.sleep('2022-10-09T22:09:59Z' - current_time)
doSth()
}
Also wanna hear about your opinion on use Temporal versus Common Distributed Scheduler for such simple scheduling tasks.
Internally Temporal relies on durable timer queue abstraction. So all these sleeping workflows are going to have a task scheduled to be delivered at that timestamp. When time comes the task is delivered, the appropriate workflow is updated with “TimerFired” event, a workflow task is put into workflow task queue. Then your workflow worker are going to pick it up, recover workflow to its last state and execute doSth() operation.
The only problem with scheduling a very large number of workflows to wake up simultaneously is that other namespaces hosted by the same cluster might experience a slowdown in task processing while these millions of timers are executed. In this case, we recommend to jit them to some reasonable period of time (let’s say to fire during a 15-minute window). If your cluster is not multi-tenant then I don’t see any issues.
Thanks for the explaining all the details. Just curious, are there any concerns (scaling issues) if those durable timers last for months or even years ?
It is not about how many timers, but the disk space that all the open workflows consume. So if the DB disk is provisioned correctly there shouldn’t be an issue.
I see for every workflow.sleep there is a thread spawned - workflow-method-workflow_id* . Will it cause resource starvation if I schedule large number of future workflows ?
Hi @maxim , in my case seems it causing issue not sure cache is getting clear or not but around 67% thread in on waiting state and mostly are thread that spawned by workflow.sleep().
So when we are starting application initial 30 minute the processing rate is 4 to 5k workflow per per minute and similarly i’m able to push workflow at same rate, but after that it is decrease to 500 to 650 per minute only, i’m using Kafka stream to start the workflow. Cassandra performance was similar, when we reached to 5k around 600 thread was in waiting state and heap was reaching to 800mb when i am using 20 thread to start workflow
.
There are many reasons for this application slowdown. For example, not having enough threads given to activities. I don’t think workflow caching is the cause.