Hi All,
We are using self-hosted open source temporal 1.22.2 and we are running into an issue where a workflow can neither be terminated nor a new workflow spun up with the same workflowId.
The first symptom of this is that trying to load the workflow on temporalUI gives me:
these errors.
If i try to describe the workflow programatically using the client, i get a WorkflowNotFoundError: operation GetWorkflowExecution encountered not found
If i try to spin up a duplicate workflow on the same Id i get: WorkflowExecutionAlreadyStartedError: Workflow execution already started
Is there a way i can force terminate the old workflow and be able to start the new one? I assume this is caused by a de-sync in the persistence layer.
We use workflowIds in the application layer to keep track of what workflows to signal/cancel so it’s imperative that we spin up with the same workflowId.
Also some weird behavior with the temporal CLI:
➜ task-pipeline git:(master) ✗ temporal workflow list --query "WorkflowId='n-6oktExPAl9X2nuh5vmo8i'"
Status WorkflowId Type StartTime
Running n-6oktExPAl9X2nuh5vmo8i executePipelineNode 1 day ago
➜ task-pipeline git:(master) ✗ temporal workflow terminate --workflow-id='n-xyb8tp8BdTqibWQdzNO9H'
time=2024-12-03T12:38:11.567 level=ERROR msg="failed to terminate workflow: operation GetWorkflowExecution encountered not found"
➜ task-pipeline git:(master) ✗ temporal workflow list --query "WorkflowId='n-xyb8tp8BdTqibWQdzNO9H'"
Status WorkflowId Type StartTime
Running n-xyb8tp8BdTqibWQdzNO9H executePipelineNode 4 weeks ago
in the first command i list an unproblematic workflow (to verify that i am correctly connected to the cluster)
the second command i try to terminate the problematic workflow which fails
the third command i query the problematic workflow which then runs correctly
Also some weird behavior with the temporal CLI
just means workflow execution with with workflow id n-xyb8tp8BdTqibWQdzNO9H
is still in visibility store with status Running, but is no longer in your primary persistence (temporal db).
probably was issue with moving visibility task to visibility store. whats your visibility store used (ES or sql)?
If i try to spin up a duplicate workflow on the same Id i get: WorkflowExecutionAlreadyStartedError: Workflow execution already started
this i think is problem, can you this also using cli (try starting a workflow exec with workflow id n-xyb8tp8BdTqibWQdzNO9H
from cli and show output please.
does this issue (not being able to start workflow with id after describing it and service responds to describe with not found) happen intermittently or consistently?
We are using mysql8 for visibility and ScyllaDB for persistence.
this happens very rarely but consistently. One out of a million workflows i would ballpark.
Is there an easy way i can clear these from visibility so i can just spin up the workflow from scratch?
Have you tried deleting this exec with
tctl --namespace <ns_name> adm wf del -w <wfid> -r <runid>
this should try to delete from both stores. otherwise you’d have to remove it manually from temporal_visibility db, executions_visibility table
1 Like
This solved the issue. Thank you so much ![:slight_smile: :slight_smile:](https://emoji.discourse-cdn.com/twitter/slight_smile.png?v=12)