10K Machines, Each with One Python Worker, Each with Dedicated Queue

Jingping_Wang · February 25, 2025, 7:27pm

I am currently in the design phase of this setup and would appreciate any suggestions or feedback!

10,000 machines, each running a Python worker
Each job is submitted to a specific machine, with a workload of several million jobs per day
At any given time, each machine can run only one workflow/activity, with no concurrent jobs on any machine

How can I address the scalability bottleneck of having 20,000 dedicated queues, each with a single Python worker? What are my available options?

Many thanks!

jwatkins · February 25, 2025, 8:09pm

From the workers’ perspective, there’s really no issue with the scenario you describe, except that you’ll need to set a minimum of two concurrent Workflow Tasks and one concurrent Activity Task. You will also want to set both Workflow and Activity Task pollers as low as possible (again, 2 and 1 respectively), to avoid putting too much pressure on your server.

jwatkins · February 25, 2025, 8:18pm

Now, from the server’s perspective… Those numbers are well within the scope of Temporal’s ability, but operating this will indeed take some particular capacity planing and tuning work.

Are you planing on hosting this yourself, or would it be running in Temporal Cloud? Assuming you’d be self-hosting this, would it be a dedicated Temporal cluster, or are you aiming to add this to an existing Temporal deployment?

Each job is submitted to a specific machine, with a workload of several million jobs per day

What do you call a “job” in this case? Do you mean million jobs per day per worker or total?

Jingping_Wang · February 26, 2025, 5:41am

1 million jobs per day in total, with each job (workflow instance) taking on average 15 minutes to complete.

jwatkins · February 26, 2025, 2:55pm

I see.

Assuming you want to operate this on a self-hosted Temporal cluster, you will want to make sure to scale and tune each server node type independently.

It is difficult to predict just how large your Temporal Cluster would need to be, but a high number of task queues and pollers would is likely to put pressure on the “matching” and “frontend” nodes, so expect to have more of those relative to the number of “history” or “worker” nodes. You will also want to configure the namespace to have only 1 partition per task queue. And of course, you will need to scale the DB servers appropriately.

A few more questions:

How many “actions” per workflow do you expect on average? e.g. activities, timers, signal/update/queries, etc?
Will you have long-running activities, that will need to heartbeat? If so, how many heartbeat would you expect on average per workflow?
Is there any security concern regarding the fact that workers may possibly “see” or “modify” each other task queues?
According to your description, the 10K workers would be operating both Activities and Workflows; I can easily see reasons to run that many Activity workers, but do you really need to also run Workflows that way? Can’t you use a limited number of dedicated Workflow Workers?

Topic		Replies	Views
Throughput and scaling of temporal workers Server Deployment go-sdk	0	74	December 17, 2024
Is a large number of worker process connections correct? Community Support worker	2	69	August 28, 2024
Suggestions to increase worker throughput Community Support	7	2020	December 10, 2020
Workflow Performance with Java SDK Community Support java-sdk	1	742	February 20, 2023
Workflow throughput Community Support performance	1	1247	May 31, 2022

10K Machines, Each with One Python Worker, Each with Dedicated Queue

Related topics