Temporal activities going into a pending state intermittently and eventually succeeding

Hey folks! need some help with temporal activities going into a pending state intermittently.

Context: We use temporal self-hosted version for file ingestion use case in our system.

Problem Summary

We are observing intermittent, multi-minute delays between ACTIVITY_TASK_SCHEDULED and ACTIVITY_TASK_STARTED for certain activities, even though:

  • The correct task queue is used

  • Workers are running and polling

  • Task queue backlog is reported as zero

  • CPU and memory usage on workers are low

Eventually, the delayed activities do start and complete successfully.


Observed Behavior

  • Each file ingestion runs as a Temporal workflow

  • A dispatcher workflow periodically starts ingestion workflows

  • Most workflows execute normally

  • A subset of workflows get stuck at a specific activity, where:

    • ACTIVITY_TASK_SCHEDULED appears in history

    • ACTIVITY_TASK_STARTED does not appear for several minutes

    • No failures are recorded during the delay

    • After a few minutes, the activity starts and proceeds normally

In the Temporal UI, this appears as an activity remaining in a “pending” state for several minutes.

Evidence from Workflow History

For an affected workflow run, the history shows (timestamps abbreviated):

... 
EVENT_TYPE_ACTIVITY_TASK_SCHEDULED   GetUploadedFileActivity   raw-ingest-queue
(no ACTIVITY_TASK_STARTED for several minutes)

Earlier activities in the same workflow run start within milliseconds of being scheduled, on the same task queue, using the same workers.

This somewhat indicates the issue is not workflow-wide and not queue-wide, but intermittent and activity-specific.


What We Have Verified / Ruled Out

We have explicitly ruled out the following:

  • :cross_mark: Wrong task queue
    Confirmed via workflow history (activityTaskScheduledEventAttributes.taskQueue.name)

  • :cross_mark: Namespace mismatch

  • :cross_mark: Worker not polling
    temporal task-queue describe shows active pollers with recent LastAccessTime

  • :cross_mark: Task queue backlog
    ApproximateBacklogCount = 0 while activity is pending

  • :cross_mark: Worker resource saturation
    CPU ~10%, memory also stable, no spikes.

  • :cross_mark: Activity execution slowness
    The delay occurs before ACTIVITY_TASK_STARTED

  • :cross_mark: Retry backoff
    The activity has not timed out or retried during the delay window

  • :cross_mark: Build ID / Worker Versioning mismatch
    Task queue shows UNVERSIONED pollers only


Worker Configuration (Go SDK)

  • MaxConcurrentActivityExecutionSize: 500

  • MaxConcurrentActivityTaskPollers: 10

  • MaxConcurrentWorkflowTaskExecutionSize: 300

  • MaxConcurrentWorkflowTaskPollers: 10

  • TaskQueueActivitiesPerSecond: default (not overridden)

Multiple worker pods are running and polling the same task queue.


Why This Is Confusing

From our understanding:

  • If a task is scheduled and the queue has active pollers and zero backlog,

  • And workers are healthy and polling,

  • Then ACTIVITY_TASK_STARTED should occur almost immediately.

However, in our case, tasks appear to be neither backlogged nor started for minutes at a time.


Questions:

  1. Are there known scenarios where activities can experience long Schedule-to-Start latency even when:

    • backlog is zero

    • pollers are active

    • workers are healthy?

  2. Are there additional server-side or SDK-level throttles or limits (besides the documented concurrency options) that could cause this behavior?

  3. What additional signals or metrics (on the server or worker side) would you recommend collecting to pinpoint why a scheduled activity is not dispatched to a polling worker?

  4. Is this behavior consistent with poll RPC backoff, long-poll interruptions, or matching behavior, and if so, what is the recommended way to detect or confirm it?


Versions

  • Temporal Server: 2.36.1

  • Temporal Go SDK: go.temporal.io/sdk v1.33.0, go.temporal.io/api v1.44.1

  • Deployment: Kubernetes, via helm (self-hosted)

  • Number of worker pods: 2

Some Background:

  1. File metadata goes into a DB (acts as a queue)
  2. A temporal cron WF spins up every few minutes and picks them up (this cron runs on a task queue named task-queue)
  3. This cron spins up 1 child workflow for each file on the task queue- raw-ingest-queue (There is a cap on this number, which we are trying to bump up and running into the above described issue frequently)
    1. This child WF further has a bunch of activities and another child workflow.

After further investigation on the Temporal server side, we found repeated errors in the History service that correlate with the periods where activities remain scheduled but are not started.

Specifically, the temporal-history pods frequently log persistence timeouts while reading workflow history:

operation: ReadHistoryBranch
error-type: context.DeadlineExceeded
error: context deadline exceeded

Example stack trace excerpt (abridged):

Operation failed with internal error.
operation: ReadHistoryBranch
...
persistence.ReadFullPageEvents
history/api.GetHistory
history/api/recordworkflowtaskstarted
historyEngineImpl.RecordWorkflowTaskStarted

These errors indicate that the History service is timing out while reading workflow history from persistence (Postgres). This happens on the critical path for progressing workflow tasks (e.g., when handling RecordWorkflowTaskStarted), and could plausibly stall workflow execution and downstream activity dispatch.

In addition, the Matching service shows intermittent errors such as:

sticky worker unavailable, please use original task queue
grpc_code: "Unavailable"

and persistence-related failures like:

Persistent store operation failure
store-operation: update-task-queue
error: context canceled

Frontend logs, by contrast, do not show corresponding errors during the same windows.

At the same time:

  • Activity task queues show active pollers and near-zero backlog
  • Workers are healthy and polling
  • Activities eventually do start once the system recovers

Taken together, this suggests the issue may be related to server-side persistence latency / timeouts (especially in History) rather than worker polling or task queue configuration.

If there are known scenarios where History persistence timeouts can manifest as intermittent Schedule→Start delays (without obvious backlog growth), or recommendations on sizing/tuning persistence for this case, guidance would be appreciated.

Screenshot of the temporal DB usage-