Cassandra query time-out when deleting workflow executions from deleted namespace

l_benke_futureplc · March 6, 2024, 2:38pm

Hi,
i’ve deleted a namespace from our self-hosted cluster using temporal CLI. The namespace contained a lot of open executions for which no workers were listening.

After deleting the namespace, a new namespace marked as deleted was created and a temporal-system workflow with id temporal-sys-reclaim-namespace-resources-workflow/namespace-name-deleted-56035 ran for ~8 hours terminating open executions, unfortunately, this workflow is still running 8 days later, but is endlessly retrying a single child workflow run with error:

operation ListClosedWorkflowExecutions encountered Operation timed out - received only 0 responses.

Cassandra reports timeouts on this query:

INFO  [Native-Transport-Requests-1] 2024-03-06 12:01:29,233 NoSpamLogger.java:105 - "Operation timed out - received only 0 responses." while executing SELECT encoding, execution_time, history_length, memo, start_time, status, task_queue, workflow_id, workflow_type_name FROM temporal_visibility_production.closed_executions WHERE namespace_id = 56035528-643f-4a17-b3e4-47fb5a65a331 AND namespace_partition = 0 AND close_time < 2024-01-22T20:57:48.693Z AND close_time >= 1970-01-01T00:00:00.000Z AND run_id > 98b943cb-9256-4074-929a-db21302c44b5 LIMIT 1000 ALLOW FILTERING

I don’t see any errors in our Cassandra logs, only relevant other bit of logs I can find is:

WARN  [CompactionExecutor:5009] 2024-03-06 08:19:18,894 BigTableWriter.java:274 - Writing 965316 tombstones to temporal_visibility_production/open_executions:56035528-643f-4a17-b3e4-47fb5a65a331:0 in sstable /bitnami/cassandra/data/data/temporal_visibility_production/open_executions-77a307e006e311eeb01af334a97e1db2/nb-25060-big-Data.db

Interestingly enough, when I change the datetime constraint for close_time on the query to 2024-01-22T22:57:48.693Z (2 more hours) - I get a lot of results back fairly quickly.

We did have an incident few months back where some wf executions were lost but remained in visibility table, not sure if that is the cause of this particular query time-out.

We are not experiencing other time-outed queries and all latencies for rest of the workloads are OK, resources for the DB are not saturated.

Is it safe to terminate the system workflow execution?

Do we need to fix the system workflow being stuck?

We’d like to start migrating to ES for advanced visibility, but not sure if we can do that with the system workflow being stuck.

Any advice appreciated

Topic		Replies	Views
Namespace deletion workflow is failing due to request timeout Community Support	2	335	July 15, 2023
Namespace Deleted Community Support namespace	2	596	May 6, 2021
Workflows Stuck in Running Mode for Several Days Community Support java-sdk , cassandra	2	943	August 24, 2021
History_node keeps growing Community Support postgresql	12	2002	January 16, 2023
Missing workflows in UI, probably issue with visabilitySampling for cassandra Community Support	2	469	April 11, 2021

Cassandra query time-out when deleting workflow executions from deleted namespace

Related topics