I am currently in the design phase of this setup and would appreciate any suggestions or feedback!
- 10,000 machines, each running a Python worker
- Each job is submitted to a specific machine, with a workload of several million jobs per day
- At any given time, each machine can run only one workflow/activity, with no concurrent jobs on any machine
How can I address the scalability bottleneck of having 20,000 dedicated queues, each with a single Python worker? What are my available options?
Many thanks!
From the workers’ perspective, there’s really no issue with the scenario you describe, except that you’ll need to set a minimum of two concurrent Workflow Tasks and one concurrent Activity Task. You will also want to set both Workflow and Activity Task pollers as low as possible (again, 2 and 1 respectively), to avoid putting too much pressure on your server.
Now, from the server’s perspective… Those numbers are well within the scope of Temporal’s ability, but operating this will indeed take some particular capacity planing and tuning work.
Are you planing on hosting this yourself, or would it be running in Temporal Cloud? Assuming you’d be self-hosting this, would it be a dedicated Temporal cluster, or are you aiming to add this to an existing Temporal deployment?
Each job is submitted to a specific machine, with a workload of several million jobs per day
What do you call a “job” in this case? Do you mean million jobs per day per worker or total?
1 million jobs per day in total, with each job (workflow instance) taking on average 15 minutes to complete.
I see.
Assuming you want to operate this on a self-hosted Temporal cluster, you will want to make sure to scale and tune each server node type independently.
It is difficult to predict just how large your Temporal Cluster would need to be, but a high number of task queues and pollers would is likely to put pressure on the “matching” and “frontend” nodes, so expect to have more of those relative to the number of “history” or “worker” nodes. You will also want to configure the namespace to have only 1 partition per task queue. And of course, you will need to scale the DB servers appropriately.
A few more questions:
- How many “actions” per workflow do you expect on average? e.g. activities, timers, signal/update/queries, etc?
- Will you have long-running activities, that will need to heartbeat? If so, how many heartbeat would you expect on average per workflow?
- Is there any security concern regarding the fact that workers may possibly “see” or “modify” each other task queues?
- According to your description, the 10K workers would be operating both Activities and Workflows; I can easily see reasons to run that many Activity workers, but do you really need to also run Workflows that way? Can’t you use a limited number of dedicated Workflow Workers?