Hi Temporal team,
We are debugging very slow ListWorkflowExecutions calls on a self-hosted Temporal cluster using Postgres advanced visibility.
At first this looked like a Postgres/index issue, but after isolating the behavior, it seems more specific: listing running workflows is fast, but listing closed/mixed workflows hangs until the client deadline is hit, even though the corresponding visibility SQL query executes in under 1ms in Postgres.
Environment
-
Temporal Server:
1.29.1 -
Docker image:
temporalio/server:1.29.1 -
Temporal UI:
2.34.0 -
Visibility store: Postgres 16, advanced visibility
-
Persistence store: Postgres 16
-
History shards: now 128
-
Namespace:
production -
Retention: 365 days
-
Custom search attributes:
-
WorkflowKind -
EntityType -
EntityId
-
-
Workload:
-
Poller-style workloads
-
workflows started every few seconds
-
short-lived workflows, mostly completed quickly
-
Symptom
Unfiltered workflow list calls fail from the local Temporal host, even when bypassing UI/reverse proxy and using 127.0.0.1:7233.
temporal workflow list \
--address 127.0.0.1:7233 \
--namespace production \
--limit 30
Result:
failed listing workflows: stream terminated by RST_STREAM with error code: CANCEL
Increasing CLI command timeout did not help:
temporal \
--command-timeout 60s \
workflow list \
--address 127.0.0.1:7233 \
--namespace production \
--limit 30
Still fails with:
failed listing workflows: stream terminated by RST_STREAM with error code: CANCEL
Sometimes the error is:
context deadline exceeded
Server logs show slow visibility/list requests around the same time:
"List query exceeded threshold","duration":10.204882112,"visibility-query":"","namespace":"production"
"Slow gRPC call","duration":10.204937273,"method":"/temporal.api.workflowservice.v1.WorkflowService/ListWorkflowExecutions"
These slow list warnings repeat. Example log excerpt also shows matching/history noise around the same time, such as Workflow task already started, but those do not appear to directly explain the visibility list stall.
Important finding: running-only query is fast
This query returns immediately:
temporal workflow list \
--address 127.0.0.1:7233 \
--namespace production \
--limit 30 \
--query 'ExecutionStatus="Running"'
Example output:
Status WorkflowId Type StartTime
Running inventory-7648 inventoryWorkflow 2 seconds ago
Running broadcast-poller-stubhub-workflow-2026-05-20T... broadcastPollerWorkflow 8 seconds ago
Running order-poller-stubhub-workflow-2026-05-20T... orderPollerWorkflow 1 minute ago
But these fail/hang:
temporal workflow list \
--address 127.0.0.1:7233 \
--namespace production \
--limit 30 \
--query 'ExecutionStatus!="Running"'
Result:
failed listing workflows: stream terminated by RST_STREAM with error code: CANCEL
And:
temporal workflow list \
--address 127.0.0.1:7233 \
--namespace production \
--limit 30 \
--query 'StartTime > "2026-05-20T00:00:00Z"'
Result:
failed listing workflows: context deadline exceeded
So the behavior seems to be:
ExecutionStatus="Running" fast
ExecutionStatus!="Running" hangs / cancels
StartTime > today hangs / deadline exceeded
empty/default list hangs / cancels
This suggests the issue is not generic visibility, not the CLI, and not the UI. It seems specific to closed/mixed workflow visibility listing.
Postgres observations during the hang
While the request was hanging, pg_stat_activity showed the visibility query had already reached Postgres and the backend was idle ClientRead, not actively executing.
Relevant query seen in pg_stat_activity:
SELECT namespace_id, run_id, workflow_type_name, workflow_id, start_time, execution_time,
status, close_time, history_length, history_size_bytes, execution_duration,
state_transition_count, memo, encoding, task_queue, search_attributes,
parent_workflow_id, parent_run_id, root_workflow_id, root_run_id, _version
FROM executions_visibility
WHERE namespace_id = $1 AND TemporalNamespaceDivision is null
ORDER BY coalesce(close_time, '9999-12-31 23:59:59') DESC, start_time DESC, run_id
LIMIT $2
Also observed near the same time:
SELECT status, COUNT(*)
FROM executions_visibility
WHERE (namespace_id = $1) AND TemporalNamespaceDivision is null
GROUP BY status
The pg_stat_activity output showed these as idle/client-read rather than actively running.
Exact Postgres query is fast
We then ran the equivalent query manually with the real namespace id.
Namespace id in visibility table:
52d6e22a-fa3f-48ae-aeeb-85b2714664c4
The table uses character(64), so the actual stored value is padded.
Manual EXPLAIN ANALYZE:
EXPLAIN (ANALYZE, BUFFERS)
SELECT namespace_id, run_id, workflow_type_name, workflow_id, start_time, execution_time,
status, close_time, history_length, memo, encoding, task_queue, search_attributes
FROM executions_visibility
WHERE namespace_id = '52d6e22a-fa3f-48ae-aeeb-85b2714664c4'
AND temporalnamespacedivision IS NULL
ORDER BY coalesce(close_time, '9999-12-31 23:59:59'::timestamp) DESC,
start_time DESC,
run_id
LIMIT 31;
Result:
Limit (cost=0.27..17.92 rows=31 width=532) (actual time=0.103..0.415 rows=31 loops=1)
Buffers: shared hit=72 dirtied=1
-> Index Scan using default_idx on executions_visibility
(cost=0.27..140.33 rows=246 width=532)
(actual time=0.102..0.402 rows=31 loops=1)
Index Cond: (namespace_id = '52d6e22a-fa3f-48ae-aeeb-85b2714664c4'::bpchar)
Filter: (temporalnamespacedivision IS NULL)
Rows Removed by Filter: 3
Buffers: shared hit=72 dirtied=1
Planning Time: 1.400 ms
Execution Time: 0.558 ms
So the exact SQL query appears to execute in ~0.5ms.
We also tried adding an expression index matching the default ordering:
CREATE INDEX CONCURRENTLY IF NOT EXISTS executions_visibility_default_order_idx
ON executions_visibility (
namespace_id,
temporalnamespacedivision,
(coalesce(close_time, '9999-12-31 23:59:59'::timestamp)) DESC,
start_time DESC,
run_id
);
ANALYZE executions_visibility;
But the planner still uses default_idx, which is already very fast for this dataset.
Current visibility table size is tiny during this test:
production visibility rows: ~233
total executions_visibility rows: ~240
So this does not appear to be caused by a large visibility table.
What we ruled out
| Suspect | Evidence | Current verdict |
|---|---|---|
| Temporal UI | Local CLI against 127.0.0.1:7233 also fails |
Not UI |
| Reverse proxy / public network | Same failure locally | Not proxy |
| CLI command timeout | --command-timeout 60s still fails |
Not enough / wrong timeout |
| Postgres query execution | Exact SQL is ~0.5ms | Not raw SQL execution |
| Missing default visibility index | default_idx is used and fast |
Unlikely |
| Unfiltered query only | ExecutionStatus!="Running" and StartTime > today also fail |
Not only empty query |
| General visibility failure | ExecutionStatus="Running" returns immediately |
Not general visibility |
Current hypothesis
The failure seems specific to closed or mixed workflow visibility listing.
Possibilities we are considering:
-
Temporal server is blocked after SQL execution while decoding/scanning closed workflow visibility rows.
-
Some closed workflow row has problematic
memoorsearch_attributes. -
The SQL visibility path for closed/mixed queries has a bug/regression in Temporal Server
1.29.1. -
The server-side visibility manager timing includes work beyond SQL execution, and that part is hanging.
-
A rate limiter, semaphore, or persistence wrapper is blocking only on closed/mixed query paths.
-
The frontend gRPC deadline/internal deadline remains around 10s even when CLI command timeout is increased.