Killing ghost workflows

Hello,

I have “ghost” workflows running. I see them logging (and failing), however, we don’t see them through the UI or CLI.

I can also see them in the current_executions table in the database with state=2 (WORKFLOW_EXECUTION_STATE_RUNNING I think).

What’s the best way to kill these workflows?

A few ideas I have:

  • rebuild the elasticsearch database - this will allow us to visibly see these invisible workflows and then terminate them through the UI
  • update the current_executions record to one of these states:
WORKFLOW_EXECUTION_STATE_COMPLETED WorkflowExecutionState = 3
WORKFLOW_EXECUTION_STATE_ZOMBIE    WorkflowExecutionState = 4
WORKFLOW_EXECUTION_STATE_VOID      WorkflowExecutionState = 5
WORKFLOW_EXECUTION_STATE_CORRUPTED WorkflowExecutionState = 6

Can you get the workflow and run ids of these executions? If so can you try to describe them via tctl or cli, for example:

tctl wf desc -w <wfid> -r <runid>

with that you could try to delete them executions with tctl/cli as well, for example:

tctl adm wf delete -w <wfid> -r <runid>

or terminate

tctl wf term -w <wfid> -r <runid>

I think we need to look into why those executions were not moved to your visibility store, for that can you look at service errors:

sum(rate(service_error_with_type{service_type="frontend"}[5m])) by (error_type)

and look at service_type being frontend/history/matching around the time when these executions have started.

Thanks @tihomir. I was able to terminate the workflows.

I think initially we weren’t looking at in the proper namespace (i.e. we hadn’t set TEMPORAL_CLI_NAMESPACE), and that combined with our visibility issues led us down the wrong path.

For the visibility store, these workflows were kicked off about 4 months ago. I know there were issues with an upgrade between then and now which perhaps led to the ES corruption. So far any recent workflows have been showing up in the UI as expected.