Hi,
i’ve deleted a namespace from our self-hosted cluster using temporal
CLI. The namespace contained a lot of open executions for which no workers were listening.
After deleting the namespace, a new namespace marked as deleted
was created and a temporal-system workflow with id temporal-sys-reclaim-namespace-resources-workflow/namespace-name-deleted-56035
ran for ~8 hours terminating open executions, unfortunately, this workflow is still running 8 days later, but is endlessly retrying a single child workflow run with error:
operation ListClosedWorkflowExecutions encountered Operation timed out - received only 0 responses.
Cassandra reports timeouts on this query:
INFO [Native-Transport-Requests-1] 2024-03-06 12:01:29,233 NoSpamLogger.java:105 - "Operation timed out - received only 0 responses." while executing SELECT encoding, execution_time, history_length, memo, start_time, status, task_queue, workflow_id, workflow_type_name FROM temporal_visibility_production.closed_executions WHERE namespace_id = 56035528-643f-4a17-b3e4-47fb5a65a331 AND namespace_partition = 0 AND close_time < 2024-01-22T20:57:48.693Z AND close_time >= 1970-01-01T00:00:00.000Z AND run_id > 98b943cb-9256-4074-929a-db21302c44b5 LIMIT 1000 ALLOW FILTERING
I don’t see any errors in our Cassandra logs, only relevant other bit of logs I can find is:
WARN [CompactionExecutor:5009] 2024-03-06 08:19:18,894 BigTableWriter.java:274 - Writing 965316 tombstones to temporal_visibility_production/open_executions:56035528-643f-4a17-b3e4-47fb5a65a331:0 in sstable /bitnami/cassandra/data/data/temporal_visibility_production/open_executions-77a307e006e311eeb01af334a97e1db2/nb-25060-big-Data.db
Interestingly enough, when I change the datetime constraint for close_time
on the query to 2024-01-22T22:57:48.693Z
(2 more hours) - I get a lot of results back fairly quickly.
We did have an incident few months back where some wf executions were lost but remained in visibility table, not sure if that is the cause of this particular query time-out.
We are not experiencing other time-outed queries and all latencies for rest of the workloads are OK, resources for the DB are not saturated.
Is it safe to terminate the system workflow execution?
Do we need to fix the system workflow being stuck?
We’d like to start migrating to ES for advanced visibility, but not sure if we can do that with the system workflow being stuck.
Any advice appreciated