Mismatch between workflow run_id reported on the UI and current_executions table

Hello all,

We ran into an instance where there was a mismatch between the workflow run_id reported on the UI and current_executions, which caused the UI to fail to execute any commands we issue.

Summary of the problem

The image above depicts a workflow execution with a run_id of 4fb406f8-0594-465c-b88c-ba721d7d6335, but when we click reset, the command is issued against the run_id of fe5a8691-08a4-41b5-b669-2fa28661aeb4 and results in failure.

The run_id 4fb406f8-0594-465c-b88c-ba721d7d6335 matches that of the executions table, whereas the run_id fe5a8691-08a4-41b5-b669-2fa28661aeb4 matches that of the current_executions table.

I’ve also attempted to execute terminate via tctl. The results were the same:

Error: Terminate workflow failed.
Error Details: Workflow executionsRow not found.  RunId: fe5a8691-08a4-41b5-b669-2fa28661aeb4

Because the run_id is not tracked in current_executions, it’s effectively orphaned and not making progress. What causes this to happen?

Expected behavior

  • Workflow will continue to make progress
  • Commands to cancel, terminate, or reset workflow execution succeeds

Thank you!

Adding a bit more context: We use sharded mysql (vitess) as our backing persistence layer.
Is there a possibility that a brief outage with one of the instances can cause inconsistencies with workflow data? Are writes to create or update workflow executions transactional?
I’m curious what else could cause this inconsistency.