Workflow stuck in limbo state

jwang97 · December 3, 2024, 5:18pm

Hi All,

We are using self-hosted open source temporal 1.22.2 and we are running into an issue where a workflow can neither be terminated nor a new workflow spun up with the same workflowId.

The first symptom of this is that trying to load the workflow on temporalUI gives me:

these errors.

If i try to describe the workflow programatically using the client, i get a WorkflowNotFoundError: operation GetWorkflowExecution encountered not found

If i try to spin up a duplicate workflow on the same Id i get: WorkflowExecutionAlreadyStartedError: Workflow execution already started

Is there a way i can force terminate the old workflow and be able to start the new one? I assume this is caused by a de-sync in the persistence layer.

We use workflowIds in the application layer to keep track of what workflows to signal/cancel so it’s imperative that we spin up with the same workflowId.

jwang97 · December 3, 2024, 5:40pm

Also some weird behavior with the temporal CLI:

➜  task-pipeline git:(master) ✗ temporal workflow list --query "WorkflowId='n-6oktExPAl9X2nuh5vmo8i'"
  Status         WorkflowId                Type          StartTime
  Running  n-6oktExPAl9X2nuh5vmo8i  executePipelineNode  1 day ago
➜  task-pipeline git:(master) ✗ temporal workflow terminate --workflow-id='n-xyb8tp8BdTqibWQdzNO9H'
time=2024-12-03T12:38:11.567 level=ERROR msg="failed to terminate workflow: operation GetWorkflowExecution encountered not found"
➜  task-pipeline git:(master) ✗ temporal workflow list --query "WorkflowId='n-xyb8tp8BdTqibWQdzNO9H'"
  Status         WorkflowId                Type           StartTime
  Running  n-xyb8tp8BdTqibWQdzNO9H  executePipelineNode  4 weeks ago

in the first command i list an unproblematic workflow (to verify that i am correctly connected to the cluster)

the second command i try to terminate the problematic workflow which fails

the third command i query the problematic workflow which then runs correctly

tihomir · December 4, 2024, 1:35pm

Also some weird behavior with the temporal CLI

just means workflow execution with with workflow id n-xyb8tp8BdTqibWQdzNO9H is still in visibility store with status Running, but is no longer in your primary persistence (temporal db).
probably was issue with moving visibility task to visibility store. whats your visibility store used (ES or sql)?

If i try to spin up a duplicate workflow on the same Id i get: WorkflowExecutionAlreadyStartedError: Workflow execution already started

this i think is problem, can you this also using cli (try starting a workflow exec with workflow id n-xyb8tp8BdTqibWQdzNO9H from cli and show output please.

does this issue (not being able to start workflow with id after describing it and service responds to describe with not found) happen intermittently or consistently?

jwang97 · December 5, 2024, 8:13pm

We are using mysql8 for visibility and ScyllaDB for persistence.

this happens very rarely but consistently. One out of a million workflows i would ballpark.

Is there an easy way i can clear these from visibility so i can just spin up the workflow from scratch?

tihomir · December 5, 2024, 8:56pm

Have you tried deleting this exec with

tctl --namespace <ns_name> adm wf del -w <wfid> -r <runid>

this should try to delete from both stores. otherwise you’d have to remove it manually from temporal_visibility db, executions_visibility table

jwang97 · December 6, 2024, 8:51pm

This solved the issue. Thank you so much

Phuong_Nguyen · March 6, 2025, 2:00am

Hi @jwang97 , re. this happens very rarely but consistently. One out of a million workflows i would ballpark.

Do you know which conditions leading to this

Topic		Replies	Views
Workflow execution not found but cannot create new workflow of the same ID Community Support general-impl , typescript-sdk , web-ui	3	398	April 7, 2024
Errors reported on Temporal server with heavy load but workflows succeed Community Support performance	6	1221	August 18, 2022
Occasionally workflow task won't be started after scheduled Community Support	16	616	November 9, 2022
Running workflows without execution in temporal-system namespace Community Support	0	406	October 4, 2023
Cannot get all workflow executions Community Support python-sdk , web-ui	5	206	July 9, 2024

Workflow stuck in limbo state

Related topics