ListWorkflowExecutions hangs/timeouts for closed or mixed workflow queries, while ExecutionStatus="Running" is fast and Postgres query executes in <1ms

Hi Temporal team,

We are debugging very slow ListWorkflowExecutions calls on a self-hosted Temporal cluster using Postgres advanced visibility.

At first this looked like a Postgres/index issue, but after isolating the behavior, it seems more specific: listing running workflows is fast, but listing closed/mixed workflows hangs until the client deadline is hit, even though the corresponding visibility SQL query executes in under 1ms in Postgres.

Environment

  • Temporal Server: 1.29.1

  • Docker image: temporalio/server:1.29.1

  • Temporal UI: 2.34.0

  • Visibility store: Postgres 16, advanced visibility

  • Persistence store: Postgres 16

  • History shards: now 128

  • Namespace: production

  • Retention: 365 days

  • Custom search attributes:

    • WorkflowKind

    • EntityType

    • EntityId

  • Workload:

    • Poller-style workloads

    • workflows started every few seconds

    • short-lived workflows, mostly completed quickly

Symptom

Unfiltered workflow list calls fail from the local Temporal host, even when bypassing UI/reverse proxy and using 127.0.0.1:7233.

temporal workflow list \
  --address 127.0.0.1:7233 \
  --namespace production \
  --limit 30

Result:

failed listing workflows: stream terminated by RST_STREAM with error code: CANCEL

Increasing CLI command timeout did not help:

temporal \
  --command-timeout 60s \
  workflow list \
  --address 127.0.0.1:7233 \
  --namespace production \
  --limit 30

Still fails with:

failed listing workflows: stream terminated by RST_STREAM with error code: CANCEL

Sometimes the error is:

context deadline exceeded

Server logs show slow visibility/list requests around the same time:

"List query exceeded threshold","duration":10.204882112,"visibility-query":"","namespace":"production"
"Slow gRPC call","duration":10.204937273,"method":"/temporal.api.workflowservice.v1.WorkflowService/ListWorkflowExecutions"

These slow list warnings repeat. Example log excerpt also shows matching/history noise around the same time, such as Workflow task already started, but those do not appear to directly explain the visibility list stall.

Important finding: running-only query is fast

This query returns immediately:

temporal workflow list \
  --address 127.0.0.1:7233 \
  --namespace production \
  --limit 30 \
  --query 'ExecutionStatus="Running"'

Example output:

Status   WorkflowId                                          Type                       StartTime
Running  inventory-7648                                      inventoryWorkflow          2 seconds ago
Running  broadcast-poller-stubhub-workflow-2026-05-20T...    broadcastPollerWorkflow    8 seconds ago
Running  order-poller-stubhub-workflow-2026-05-20T...        orderPollerWorkflow        1 minute ago

But these fail/hang:

temporal workflow list \
  --address 127.0.0.1:7233 \
  --namespace production \
  --limit 30 \
  --query 'ExecutionStatus!="Running"'

Result:

failed listing workflows: stream terminated by RST_STREAM with error code: CANCEL

And:

temporal workflow list \
  --address 127.0.0.1:7233 \
  --namespace production \
  --limit 30 \
  --query 'StartTime > "2026-05-20T00:00:00Z"'

Result:

failed listing workflows: context deadline exceeded

So the behavior seems to be:

ExecutionStatus="Running"        fast
ExecutionStatus!="Running"       hangs / cancels
StartTime > today                hangs / deadline exceeded
empty/default list               hangs / cancels

This suggests the issue is not generic visibility, not the CLI, and not the UI. It seems specific to closed/mixed workflow visibility listing.

Postgres observations during the hang

While the request was hanging, pg_stat_activity showed the visibility query had already reached Postgres and the backend was idle ClientRead, not actively executing.

Relevant query seen in pg_stat_activity:

SELECT namespace_id, run_id, workflow_type_name, workflow_id, start_time, execution_time,
       status, close_time, history_length, history_size_bytes, execution_duration,
       state_transition_count, memo, encoding, task_queue, search_attributes,
       parent_workflow_id, parent_run_id, root_workflow_id, root_run_id, _version
FROM executions_visibility
WHERE namespace_id = $1 AND TemporalNamespaceDivision is null
ORDER BY coalesce(close_time, '9999-12-31 23:59:59') DESC, start_time DESC, run_id
LIMIT $2

Also observed near the same time:

SELECT status, COUNT(*)
FROM executions_visibility
WHERE (namespace_id = $1) AND TemporalNamespaceDivision is null
GROUP BY status

The pg_stat_activity output showed these as idle/client-read rather than actively running.

Exact Postgres query is fast

We then ran the equivalent query manually with the real namespace id.

Namespace id in visibility table:

52d6e22a-fa3f-48ae-aeeb-85b2714664c4

The table uses character(64), so the actual stored value is padded.

Manual EXPLAIN ANALYZE:

EXPLAIN (ANALYZE, BUFFERS)
SELECT namespace_id, run_id, workflow_type_name, workflow_id, start_time, execution_time,
       status, close_time, history_length, memo, encoding, task_queue, search_attributes
FROM executions_visibility
WHERE namespace_id = '52d6e22a-fa3f-48ae-aeeb-85b2714664c4'
  AND temporalnamespacedivision IS NULL
ORDER BY coalesce(close_time, '9999-12-31 23:59:59'::timestamp) DESC,
         start_time DESC,
         run_id
LIMIT 31;

Result:

Limit  (cost=0.27..17.92 rows=31 width=532) (actual time=0.103..0.415 rows=31 loops=1)
  Buffers: shared hit=72 dirtied=1
  ->  Index Scan using default_idx on executions_visibility
        (cost=0.27..140.33 rows=246 width=532)
        (actual time=0.102..0.402 rows=31 loops=1)
        Index Cond: (namespace_id = '52d6e22a-fa3f-48ae-aeeb-85b2714664c4'::bpchar)
        Filter: (temporalnamespacedivision IS NULL)
        Rows Removed by Filter: 3
        Buffers: shared hit=72 dirtied=1
Planning Time: 1.400 ms
Execution Time: 0.558 ms

So the exact SQL query appears to execute in ~0.5ms.

We also tried adding an expression index matching the default ordering:

CREATE INDEX CONCURRENTLY IF NOT EXISTS executions_visibility_default_order_idx
ON executions_visibility (
  namespace_id,
  temporalnamespacedivision,
  (coalesce(close_time, '9999-12-31 23:59:59'::timestamp)) DESC,
  start_time DESC,
  run_id
);

ANALYZE executions_visibility;

But the planner still uses default_idx, which is already very fast for this dataset.

Current visibility table size is tiny during this test:

production visibility rows: ~233
total executions_visibility rows: ~240

So this does not appear to be caused by a large visibility table.

What we ruled out

Suspect Evidence Current verdict
Temporal UI Local CLI against 127.0.0.1:7233 also fails Not UI
Reverse proxy / public network Same failure locally Not proxy
CLI command timeout --command-timeout 60s still fails Not enough / wrong timeout
Postgres query execution Exact SQL is ~0.5ms Not raw SQL execution
Missing default visibility index default_idx is used and fast Unlikely
Unfiltered query only ExecutionStatus!="Running" and StartTime > today also fail Not only empty query
General visibility failure ExecutionStatus="Running" returns immediately Not general visibility

Current hypothesis

The failure seems specific to closed or mixed workflow visibility listing.

Possibilities we are considering:

  1. Temporal server is blocked after SQL execution while decoding/scanning closed workflow visibility rows.

  2. Some closed workflow row has problematic memo or search_attributes.

  3. The SQL visibility path for closed/mixed queries has a bug/regression in Temporal Server 1.29.1.

  4. The server-side visibility manager timing includes work beyond SQL execution, and that part is hanging.

  5. A rate limiter, semaphore, or persistence wrapper is blocking only on closed/mixed query paths.

  6. The frontend gRPC deadline/internal deadline remains around 10s even when CLI command timeout is increased.