Workflow stuck in limbo state

Hi All,

We are using self-hosted open source temporal 1.22.2 and we are running into an issue where a workflow can neither be terminated nor a new workflow spun up with the same workflowId.

The first symptom of this is that trying to load the workflow on temporalUI gives me:



these errors.

If i try to describe the workflow programatically using the client, i get a WorkflowNotFoundError: operation GetWorkflowExecution encountered not found

If i try to spin up a duplicate workflow on the same Id i get: WorkflowExecutionAlreadyStartedError: Workflow execution already started

Is there a way i can force terminate the old workflow and be able to start the new one? I assume this is caused by a de-sync in the persistence layer.

We use workflowIds in the application layer to keep track of what workflows to signal/cancel so it’s imperative that we spin up with the same workflowId.

Also some weird behavior with the temporal CLI:

➜  task-pipeline git:(master) ✗ temporal workflow list --query "WorkflowId='n-6oktExPAl9X2nuh5vmo8i'"
  Status         WorkflowId                Type          StartTime
  Running  n-6oktExPAl9X2nuh5vmo8i  executePipelineNode  1 day ago
➜  task-pipeline git:(master) ✗ temporal workflow terminate --workflow-id='n-xyb8tp8BdTqibWQdzNO9H'
time=2024-12-03T12:38:11.567 level=ERROR msg="failed to terminate workflow: operation GetWorkflowExecution encountered not found"
➜  task-pipeline git:(master) ✗ temporal workflow list --query "WorkflowId='n-xyb8tp8BdTqibWQdzNO9H'"
  Status         WorkflowId                Type           StartTime
  Running  n-xyb8tp8BdTqibWQdzNO9H  executePipelineNode  4 weeks ago

in the first command i list an unproblematic workflow (to verify that i am correctly connected to the cluster)

the second command i try to terminate the problematic workflow which fails

the third command i query the problematic workflow which then runs correctly

Also some weird behavior with the temporal CLI

just means workflow execution with with workflow id n-xyb8tp8BdTqibWQdzNO9H is still in visibility store with status Running, but is no longer in your primary persistence (temporal db).
probably was issue with moving visibility task to visibility store. whats your visibility store used (ES or sql)?

If i try to spin up a duplicate workflow on the same Id i get: WorkflowExecutionAlreadyStartedError: Workflow execution already started

this i think is problem, can you this also using cli (try starting a workflow exec with workflow id n-xyb8tp8BdTqibWQdzNO9H from cli and show output please.

does this issue (not being able to start workflow with id after describing it and service responds to describe with not found) happen intermittently or consistently?

We are using mysql8 for visibility and ScyllaDB for persistence.

this happens very rarely but consistently. One out of a million workflows i would ballpark.

Is there an easy way i can clear these from visibility so i can just spin up the workflow from scratch?

Have you tried deleting this exec with

tctl --namespace <ns_name> adm wf del -w <wfid> -r <runid>

this should try to delete from both stores. otherwise you’d have to remove it manually from temporal_visibility db, executions_visibility table

1 Like

This solved the issue. Thank you so much :slight_smile: