What are the recommended settings for workflow and activity pollers count?

Header Note:

Your workers concurrently poll Temporal server for workflow tasks (using long-polling). The number of the set pollers per worker is important to consider when you are fine-tuning performance of your Temporal applications.
The settings we are looking to tune here are on your SDK code side, specifically in WorkerOptions:


This is a complex question and there is no easy answer.

It depends on how many workers you might have and how many tasks you expect your worker to process as well as how many concurrent running workflows each worker can handle.

In order to find the total workflow/activity tasks per second get the workflow execution history, with tctl for example:
tctl wf show -w <wfid> -r <runid> --output_filename myhistory.json
and then count how many workflow task and activity task they have, then multiply by the number of workflows per second at the peak load time, that will give you total workflow/activity tasks per second.

Once you have that, and let’s say we assume our pollers at “full speed”, are able to fetch 5 tasks per second, we can get a base line on the total number of pollers we need to cover this peak load time.

Then we also need to decide the max concurrent execution size for activity and workflow tasks.
If your activity and workflow task are all short running, the concurrent execution size will stay small.
However if your activities takes longer time or your workflow task runs local activities, then they can take up time and your concurrent execution size can become a limitation.
Each concurrent execution takes CPU, so you cannot just set the concurrent execution size to be super large.

Next thing to do is to decide how many workers you want to run. You also need to decide how many concurrent workflows your worker can host at peak hours because each worker process has a sticky cache. Each running workflow execution will stay in that cache to support sticky execution.
Sticky execution increases performance on the worker side because when an execution is in workers cache it full history does not have to be replayed by the worker before the fetched workflow task is evaluated).
If you have too many running workflows, the sticky cache will evict them which will make next workflow task to replay from beginning.
So ideally, you want to make your total sticky cache size big enough to host your workflow executions at these peak hours.
See worker tuning guide in docs for more info on this.

Testing here is important in order to know the memory size for each running workflow as this size highly depends on your workflow logic.
The general rule of thumb is that you always want to have your poller waiting for tasks, and not having task backlog build up.

One metrics to monitor is the latency for PollWorkflowTaskQueue/PollActivityTaskQueue operations.
You want to make sure there is always some requests of these 2 API calls with latency as close to as possible to their timeout (like 60s). If you see your P95 poll latency start to drop, it means your poller is not waiting for task, which means there is likely task backlog building up.
Granafa query you could use for this is:

histogram_quantile(0.95, sum(rate(service_latency_bucket{operation=~"PollActivityTaskQueue|PollWorkflowTaskQueue"}[5m])) by (operation, le))

Footer note:

This content was created by @Yimin_Chen on Slack, I just formatted it for forum and added a link or two.