Activity poller becomes inactive, activities stuck in PENDING_ACTIVITY_STATE_SCHEDULED state

Hi everyone,

We have a workflow that periodically fetches a list of ids using an activity and starts a child workflow for each of them. The child workflows run a long-polling activity sending periodic heartbeats. Child workflows also respond to updates, making a request using activity and returning the response to the client.

Everything is working fine initially after deployment (using ECS with a single temporal server and one worker instance). However, after a while the main workflow stops executing activities, with the last activity stuck in PENDING_ACTIVITY_STATE_SCHEDULED state. The long polling activities continue to run and send periodic heartbeats, which are visible in the UI. If a child workflow receives an update, however, it will try to start an activity that will also get stuck in a pending state.

I can see in the Temporal UI the worker active as Workflow Task Handler but inactive as Activity Handler. Using the CLI, if I run temporal task-queue describe immediately after deployment, I can see active pollers for both workflow and activity task queues. After a while, the backlog begins to grow in the activity task queue and then the activity poller disappears.

The issue seems to resolve itself if the worker task is restarted. Then it picks up all pending activities, and everything returns to normal for a while. I could not find any log messages in the worker or temporal server logs that would indicate some error during the time of the incident.

Has anyone experienced a similar problem, or have any ideas on what may be causing an activity poller to disappear? Would appreciate any advice on how to diagnose this further.

This is a common problem. Temporal SDK only polls when there are free slots to execute activities. If all slots are taken, polling stops. Make sure that your activity implementations release slots after completing. It can be exacerbated by activities running longer than the StartToClose timeout. The service sees the timeout and retries an activity according to the retry options even if the activity is still running on a worker.

Hi Maxim, thank you for your response.

I’m not sure if this description is applicable to our case. The issue can be reproduced with only a single child workflow running. In this setup, there will be at most three activities running at the same time.

Screenshots below demonstrate how the problem typically manifests:

  • fetchAccounts activity is running in the main workflow at regular intervals until it eventually hangs
  • statPolling The long polling activity in child workflows continues to run and send periodic heartbetas
  • getRawInventory is short-lived activity started in response to updates sent to a child workflow


The long-polling activity is designed to run forever, sending periodic heartbeats and properly handling cancellation. The other activities have startToCloseTimeout set to a minute, but they usually finish in a couple of seconds and are not likely to run in parallel.

Given this context, could there be other factors at play here?

Have you checked worker_task_slots_available metric?

Thanks for the pointer, Maxim! Enabling the SDK metrics was invaluable help for diagnosing the issue. As you suggested, I was able to observe the number of used slots constantly growing before the activities froze.

I narrowed down the problem to an activity from the main workflow that was meant to observe child workflow’s completion after the main workflow was restarted as new.

The change that was needed was to handle activity cancellation as follows:

import { cancelled } from '@temporalio/activity';

await Promise.race([
  workflowClient.getHandle(workflowId).result(),
  cancelled(),
]);

I now have an on how activity slots remain occupied when activity is retried. If activity returns a promise that never resolves, its slot will never be freed.

I think this idea is wrong because if the above code could be wrapped around every activity, that would be the default behavior. So there’s probably more going on under the hood that I’m missing.

Looking forward to finding more information on activity slots in the future. Thanks for all the pointers, Maxim.