Temporal activities going into a pending state intermittently and eventually succeeding

nipun · January 8, 2026, 4:46am

Hey folks! need some help with temporal activities going into a pending state intermittently.

Context: We use temporal self-hosted version for file ingestion use case in our system.

Problem Summary

We are observing intermittent, multi-minute delays between ACTIVITY_TASK_SCHEDULED and ACTIVITY_TASK_STARTED for certain activities, even though:

The correct task queue is used
Workers are running and polling
Task queue backlog is reported as zero
CPU and memory usage on workers are low

Eventually, the delayed activities do start and complete successfully.

Observed Behavior

Each file ingestion runs as a Temporal workflow
A dispatcher workflow periodically starts ingestion workflows
Most workflows execute normally
A subset of workflows get stuck at a specific activity, where:
- ACTIVITY_TASK_SCHEDULED appears in history
- ACTIVITY_TASK_STARTED does not appear for several minutes
- No failures are recorded during the delay
- After a few minutes, the activity starts and proceeds normally

In the Temporal UI, this appears as an activity remaining in a “pending” state for several minutes.

Screenshot 2026-01-08 at 10.17.251920×829 230 KB

Evidence from Workflow History

For an affected workflow run, the history shows (timestamps abbreviated):

... 
EVENT_TYPE_ACTIVITY_TASK_SCHEDULED   GetUploadedFileActivity   raw-ingest-queue
(no ACTIVITY_TASK_STARTED for several minutes)

Earlier activities in the same workflow run start within milliseconds of being scheduled, on the same task queue, using the same workers.

This somewhat indicates the issue is not workflow-wide and not queue-wide, but intermittent and activity-specific.

What We Have Verified / Ruled Out

We have explicitly ruled out the following:

Wrong task queue
Confirmed via workflow history (activityTaskScheduledEventAttributes.taskQueue.name)
Namespace mismatch
Worker not polling
temporal task-queue describe shows active pollers with recent LastAccessTime
Task queue backlog
ApproximateBacklogCount = 0 while activity is pending
Worker resource saturation
CPU ~10%, memory also stable, no spikes.
Activity execution slowness
The delay occurs before ACTIVITY_TASK_STARTED
Retry backoff
The activity has not timed out or retried during the delay window
Build ID / Worker Versioning mismatch
Task queue shows UNVERSIONED pollers only

Worker Configuration (Go SDK)

MaxConcurrentActivityExecutionSize: 500
MaxConcurrentActivityTaskPollers: 10
MaxConcurrentWorkflowTaskExecutionSize: 300
MaxConcurrentWorkflowTaskPollers: 10
TaskQueueActivitiesPerSecond: default (not overridden)

Multiple worker pods are running and polling the same task queue.

Why This Is Confusing

From our understanding:

If a task is scheduled and the queue has active pollers and zero backlog,
And workers are healthy and polling,
Then ACTIVITY_TASK_STARTED should occur almost immediately.

However, in our case, tasks appear to be neither backlogged nor started for minutes at a time.

Questions:

Are there known scenarios where activities can experience long Schedule-to-Start latency even when:
- backlog is zero
- pollers are active
- workers are healthy?
Are there additional server-side or SDK-level throttles or limits (besides the documented concurrency options) that could cause this behavior?
What additional signals or metrics (on the server or worker side) would you recommend collecting to pinpoint why a scheduled activity is not dispatched to a polling worker?
Is this behavior consistent with poll RPC backoff, long-poll interruptions, or matching behavior, and if so, what is the recommended way to detect or confirm it?

Versions

Temporal Server: 2.36.1
Temporal Go SDK: go.temporal.io/sdk v1.33.0, go.temporal.io/api v1.44.1
Deployment: Kubernetes, via helm (self-hosted)
Number of worker pods: 2

Some Background:

File metadata goes into a DB (acts as a queue)
A temporal cron WF spins up every few minutes and picks them up (this cron runs on a task queue named task-queue)
This cron spins up 1 child workflow for each file on the task queue- raw-ingest-queue (There is a cap on this number, which we are trying to bump up and running into the above described issue frequently)
1. This child WF further has a bunch of activities and another child workflow.

nipun · January 8, 2026, 5:38am

After further investigation on the Temporal server side, we found repeated errors in the History service that correlate with the periods where activities remain scheduled but are not started.

Specifically, the temporal-history pods frequently log persistence timeouts while reading workflow history:

operation: ReadHistoryBranch
error-type: context.DeadlineExceeded
error: context deadline exceeded

Example stack trace excerpt (abridged):

Operation failed with internal error.
operation: ReadHistoryBranch
...
persistence.ReadFullPageEvents
history/api.GetHistory
history/api/recordworkflowtaskstarted
historyEngineImpl.RecordWorkflowTaskStarted

These errors indicate that the History service is timing out while reading workflow history from persistence (Postgres). This happens on the critical path for progressing workflow tasks (e.g., when handling RecordWorkflowTaskStarted), and could plausibly stall workflow execution and downstream activity dispatch.

In addition, the Matching service shows intermittent errors such as:

sticky worker unavailable, please use original task queue
grpc_code: "Unavailable"

and persistence-related failures like:

Persistent store operation failure
store-operation: update-task-queue
error: context canceled

Frontend logs, by contrast, do not show corresponding errors during the same windows.

At the same time:

Activity task queues show active pollers and near-zero backlog
Workers are healthy and polling
Activities eventually do start once the system recovers

Taken together, this suggests the issue may be related to server-side persistence latency / timeouts (especially in History) rather than worker polling or task queue configuration.

If there are known scenarios where History persistence timeouts can manifest as intermittent Schedule→Start delays (without obvious backlog growth), or recommendations on sizing/tuning persistence for this case, guidance would be appreciated.

Screenshot of the temporal DB usage-

Topic		Replies	Views
Temporal Activity Poll & Start Delays - Issues under Load Community Support java-sdk , general-impl	6	851	May 24, 2023
Some activities seem to be stuck & not starting Server Deployment	3	1841	December 10, 2023
Activity scheduled but not started (need help) Community Support go-sdk	22	5769	June 27, 2022
Activity reverts from state "Started" to "Scheduled" after 1 hour Community Support go-sdk , async , activity	6	1303	August 13, 2020
Temporal Activity stuck in PENDING_ACTIVITY_STATE_SCHEDULED Community Support java-sdk	3	341	April 10, 2025