Understanding Slow Activity Execution and Timeouts Despite Low Worker CPU Usage

I’m encountering an issue with activity execution performance on a Temporal worker and would appreciate some insights.

My Setup:

  • I’m scheduling a large number of workflows, which in turn trigger child workflows and schedule numerous activities.
  • Activities are distributed across two main task queues:
    • One task queue is for specific activities that are rate-limited externally (e.g., calls to an external API with quotas). This queue is functioning as expected, and we can disregard it for this problem.
    • A “default” task queue for all other activities, which is not intentionally rate-limited.
  • My Temporal worker is running on a machine that shows low resource stress (CPU usage consistently below 30%).
  • Activities are implemented as async def methods. Within these activities, I’m using await loop.run_in_executor(None, blocking_function, *args) to run blocking I/O-bound or CPU-bound tasks.
  • The worker is configured with a ThreadPoolExecutor for activities. Here’s a snippet of the relevant worker configuration from my main.py:
# ...
    activity_executor = ThreadPoolExecutor(max_workers=100)

# ...
    # Configuration for the "default" task queue worker
    general_worker = Worker(
        client,
        task_queue=settings.temporal_task_queue, # This is the default, non-rate-limited queue
        workflows=workflows_list,
        activities=activities_list,
        workflow_task_poller_behavior=PollerBehaviorAutoscaling(),
        activity_task_poller_behavior=PollerBehaviorAutoscaling(),
        max_concurrent_activities=50,
        activity_executor=activity_executor, # Shared with the rate-limited queue's worker
    )
# ...

The Problem:

Activities on the default task queue are not executing as quickly as I would expect. These activities typically complete in a few seconds or less when run in isolation. However, under load, I’m observing:

  1. Increased Latency: The activities take significantly longer to complete.
  2. Start-to-Close Timeouts: Many activities are hitting their start-to-close timeout.
  3. Bottleneck at run_in_executor: It appears that when an activity task is picked up by the worker and its Python method is invoked, the call to await loop.run_in_executor(...) doesn’t immediately execute the target function. Instead, these calls seem to be queued up, waiting for a thread in the ThreadPoolExecutor.
  4. Scaling max_workers in ThreadPoolExecutor: I’ve tried increasing the max_workers in the ThreadPoolExecutor (e.g., from 50 to 100, and even higher), but this hasn’t resolved the issue. The machine’s CPU usage remains low, but more activities seem to get queued up internally before run_in_executor actually processes them. This might be somewhat expected, as with more available worker slots in the executor, more activities can be pulled from Temporal and begin processing, only to then wait for the run_in_executor call.

My Question(s):

Given that the worker machine’s CPU usage is low (< 30%), I’m struggling to understand why the ThreadPoolExecutor would be causing such a bottleneck. If the threads were busy with CPU-bound work, I’d expect higher CPU usage. If they are primarily I/O-bound (which many of these tasks are), I thought the ThreadPoolExecutor (with a sufficient number of threads, e.g., 100 in my case) and asyncio would handle this efficiently by yielding control while waiting for I/O.

  • Am I misunderstanding how max_concurrent_activities interacts with the ThreadPoolExecutor?
  • Could the sharing of the ThreadPoolExecutor between the rate-limited queue worker and the default queue worker be a contributing factor, even if the rate-limited activities are infrequent?
  • Are there other Temporal worker or ThreadPoolExecutor configurations I should be looking into that might explain why tasks are queuing at run_in_executor despite low CPU utilization?

Any guidance or suggestions on what to investigate would be greatly appreciated!

Thank you!

Correct me if I misunderstand, but it sounds like the interface to Temporal is working OK as far as you can tell (workers are pulling work to do), but then in the execution of the activity you’re running into something with your own code (weirdness with Python’s ThreadPoolExecutor, not related to Temporal). My suggestion in that case is to bypass Temporal, and write a small program which runs the code implementing the activity, skipping the Temporal worker queue. Then it’s easier to try hitting your thread pool with a lot of concurrent requests and see if you can track down why it isn’t working as expected.

Hi awwx, thanks for your reply. I will definitely look into it and try to have a small test maybe without Temporal.

Meanwhile thank you very much for the feedback :slight_smile: