Recently we noticed a strange occurrence in our Temporal cluster in test environment. One of our custom applications reporting metrics failed to execute while trying to get workflow execution history for all running workflows - when going through the temporal-system namespace, getting the history returned the NOT_FOUND exception. I listed the running workflows using tctl to verify the error, and noticed that we had two workflows of the same workflow ID:
bash-4.2$ tctl --namespace temporal-system workflow list --query 'ExecutionStatus="Running"' WORKFLOW TYPE | WORKFLOW ID | RUN ID | TASK QUEUE | START TIME | EXECUTION TIME | END TIME ...temporal-sys-tq-scanner-workflow | temporal-sys-tq-scanner | 78bc4589-777d-429a-b176-0aaa86ae3dc9 | temporal-sys-tq-scanner-taskqueue-0 | 00:01:10 | 12:00:00 | 00:00:00 ...ral-sys-history-scanner-workflow | temporal-sys-history-scanner | 064d3119-23b1-4e1f-a53c-e4d69f09f84a | temporal-sys-history-scanner-taskqueue-0 | 00:00:27 | 12:00:00 | 00:00:00 ...temporal-sys-tq-scanner-workflow | temporal-sys-tq-scanner | 8fef3b86-c282-413f-b95a-e068c5c5b21f | temporal-sys-tq-scanner-taskqueue-0 | 12:00:00 | 00:00:00 | 00:00:00
When trying to describe the workflow with RunId
8fef3b86-c282-413f-b95a-e068c5c5b21f, I got indeed the NOT_FOUND exception. The other task queue scanner workflow seemed to be fine, as I was able to get more details about it. There are also server logs mentioning the faulty workflow:
"ts": "2023-09-25T00:00:00.425Z", "msg": "Workflow task not found", "service": "matching", "error": "Workflow executionsRow not found. WorkflowId: temporal-sys-tq-scanner, RunId: 8fef3b86-c282-413f-b95a-e068c5c5b21f"
Similar logs can be found in the history service as well. I also see that metric reporting workflow timeouts was increased at the time when the workflow was supposed to be scheduled (at midnight).
Shouldn’t it be disallowed to execute two workflows of the same ID? Or are there different rules for schedules/workflows that run in temporal-system namespace? The workflow seems to be stuck (or at least its reference), while other executions of the same type/workflow-id continue to run properly (I can trace the reschedules for the other scanner workflow)
Workflow’s run timeout is set to 5 days. I listed the workflows after that period and saw that 2 new
temporal-sys-tq-scanner were scheduled, and again, one of them doesn’t have any execution.
What do you suggest could be the cause of this behaviour? What can we do to fix it?
Thanks in advance for your help.