I’m encountering an issue with activity execution performance on a Temporal worker and would appreciate some insights.
My Setup:
- I’m scheduling a large number of workflows, which in turn trigger child workflows and schedule numerous activities.
- Activities are distributed across two main task queues:
- One task queue is for specific activities that are rate-limited externally (e.g., calls to an external API with quotas). This queue is functioning as expected, and we can disregard it for this problem.
- A “default” task queue for all other activities, which is not intentionally rate-limited.
- My Temporal worker is running on a machine that shows low resource stress (CPU usage consistently below 30%).
- Activities are implemented as
async def
methods. Within these activities, I’m usingawait loop.run_in_executor(None, blocking_function, *args)
to run blocking I/O-bound or CPU-bound tasks. - The worker is configured with a
ThreadPoolExecutor
for activities. Here’s a snippet of the relevant worker configuration from mymain.py
:
# ...
activity_executor = ThreadPoolExecutor(max_workers=100)
# ...
# Configuration for the "default" task queue worker
general_worker = Worker(
client,
task_queue=settings.temporal_task_queue, # This is the default, non-rate-limited queue
workflows=workflows_list,
activities=activities_list,
workflow_task_poller_behavior=PollerBehaviorAutoscaling(),
activity_task_poller_behavior=PollerBehaviorAutoscaling(),
max_concurrent_activities=50,
activity_executor=activity_executor, # Shared with the rate-limited queue's worker
)
# ...
The Problem:
Activities on the default task queue are not executing as quickly as I would expect. These activities typically complete in a few seconds or less when run in isolation. However, under load, I’m observing:
- Increased Latency: The activities take significantly longer to complete.
- Start-to-Close Timeouts: Many activities are hitting their start-to-close timeout.
- Bottleneck at
run_in_executor
: It appears that when an activity task is picked up by the worker and its Python method is invoked, the call toawait loop.run_in_executor(...)
doesn’t immediately execute the target function. Instead, these calls seem to be queued up, waiting for a thread in theThreadPoolExecutor
. - Scaling
max_workers
inThreadPoolExecutor
: I’ve tried increasing themax_workers
in theThreadPoolExecutor
(e.g., from 50 to 100, and even higher), but this hasn’t resolved the issue. The machine’s CPU usage remains low, but more activities seem to get queued up internally beforerun_in_executor
actually processes them. This might be somewhat expected, as with more available worker slots in the executor, more activities can be pulled from Temporal and begin processing, only to then wait for therun_in_executor
call.
My Question(s):
Given that the worker machine’s CPU usage is low (< 30%), I’m struggling to understand why the ThreadPoolExecutor
would be causing such a bottleneck. If the threads were busy with CPU-bound work, I’d expect higher CPU usage. If they are primarily I/O-bound (which many of these tasks are), I thought the ThreadPoolExecutor
(with a sufficient number of threads, e.g., 100 in my case) and asyncio
would handle this efficiently by yielding control while waiting for I/O.
- Am I misunderstanding how
max_concurrent_activities
interacts with theThreadPoolExecutor
? - Could the sharing of the
ThreadPoolExecutor
between the rate-limited queue worker and the default queue worker be a contributing factor, even if the rate-limited activities are infrequent? - Are there other Temporal worker or
ThreadPoolExecutor
configurations I should be looking into that might explain why tasks are queuing atrun_in_executor
despite low CPU utilization?
Any guidance or suggestions on what to investigate would be greatly appreciated!
Thank you!