Hey folks! need some help with temporal activities going into a pending state intermittently.
Context: We use temporal self-hosted version for file ingestion use case in our system.
Problem Summary
We are observing intermittent, multi-minute delays between ACTIVITY_TASK_SCHEDULED and ACTIVITY_TASK_STARTED for certain activities, even though:
-
The correct task queue is used
-
Workers are running and polling
-
Task queue backlog is reported as zero
-
CPU and memory usage on workers are low
Eventually, the delayed activities do start and complete successfully.
Observed Behavior
-
Each file ingestion runs as a Temporal workflow
-
A dispatcher workflow periodically starts ingestion workflows
-
Most workflows execute normally
-
A subset of workflows get stuck at a specific activity, where:
-
ACTIVITY_TASK_SCHEDULEDappears in history -
ACTIVITY_TASK_STARTEDdoes not appear for several minutes -
No failures are recorded during the delay
-
After a few minutes, the activity starts and proceeds normally
-
In the Temporal UI, this appears as an activity remaining in a “pending” state for several minutes.
Evidence from Workflow History
For an affected workflow run, the history shows (timestamps abbreviated):
...
EVENT_TYPE_ACTIVITY_TASK_SCHEDULED GetUploadedFileActivity raw-ingest-queue
(no ACTIVITY_TASK_STARTED for several minutes)
Earlier activities in the same workflow run start within milliseconds of being scheduled, on the same task queue, using the same workers.
This somewhat indicates the issue is not workflow-wide and not queue-wide, but intermittent and activity-specific.
What We Have Verified / Ruled Out
We have explicitly ruled out the following:
-
Wrong task queue
Confirmed via workflow history (activityTaskScheduledEventAttributes.taskQueue.name) -
Namespace mismatch -
Worker not polling
temporal task-queue describeshows active pollers with recentLastAccessTime -
Task queue backlog
ApproximateBacklogCount = 0while activity is pending -
Worker resource saturation
CPU ~10%, memory also stable, no spikes. -
Activity execution slowness
The delay occurs beforeACTIVITY_TASK_STARTED -
Retry backoff
The activity has not timed out or retried during the delay window -
Build ID / Worker Versioning mismatch
Task queue showsUNVERSIONEDpollers only
Worker Configuration (Go SDK)
-
MaxConcurrentActivityExecutionSize: 500 -
MaxConcurrentActivityTaskPollers: 10 -
MaxConcurrentWorkflowTaskExecutionSize: 300 -
MaxConcurrentWorkflowTaskPollers: 10 -
TaskQueueActivitiesPerSecond: default (not overridden)
Multiple worker pods are running and polling the same task queue.
Why This Is Confusing
From our understanding:
-
If a task is scheduled and the queue has active pollers and zero backlog,
-
And workers are healthy and polling,
-
Then
ACTIVITY_TASK_STARTEDshould occur almost immediately.
However, in our case, tasks appear to be neither backlogged nor started for minutes at a time.
Questions:
-
Are there known scenarios where activities can experience long Schedule-to-Start latency even when:
-
backlog is zero
-
pollers are active
-
workers are healthy?
-
-
Are there additional server-side or SDK-level throttles or limits (besides the documented concurrency options) that could cause this behavior?
-
What additional signals or metrics (on the server or worker side) would you recommend collecting to pinpoint why a scheduled activity is not dispatched to a polling worker?
-
Is this behavior consistent with poll RPC backoff, long-poll interruptions, or matching behavior, and if so, what is the recommended way to detect or confirm it?
Versions
-
Temporal Server:
2.36.1 -
Temporal Go SDK:
go.temporal.io/sdk v1.33.0,go.temporal.io/api v1.44.1 -
Deployment: Kubernetes, via helm (self-hosted)
-
Number of worker pods:
2
Some Background:
- File metadata goes into a DB (acts as a queue)
- A temporal cron WF spins up every few minutes and picks them up (this cron runs on a task queue named
task-queue) - This cron spins up 1 child workflow for each file on the task queue-
raw-ingest-queue(There is a cap on this number, which we are trying to bump up and running into the above described issue frequently)- This child WF further has a bunch of activities and another child workflow.

