Running workflows without execution in temporal-system namespace

wrnk_kaluza · October 4, 2023, 10:39am

Recently we noticed a strange occurrence in our Temporal cluster in test environment. One of our custom applications reporting metrics failed to execute while trying to get workflow execution history for all running workflows - when going through the temporal-system namespace, getting the history returned the NOT_FOUND exception. I listed the running workflows using tctl to verify the error, and noticed that we had two workflows of the same workflow ID:

bash-4.2$ tctl --namespace temporal-system workflow list --query 'ExecutionStatus="Running"'
             WORKFLOW TYPE            |         WORKFLOW ID          |                RUN ID                |                TASK QUEUE                | START TIME | EXECUTION TIME | END TIME  
  ...temporal-sys-tq-scanner-workflow | temporal-sys-tq-scanner      | 78bc4589-777d-429a-b176-0aaa86ae3dc9 | temporal-sys-tq-scanner-taskqueue-0      | 00:01:10   | 12:00:00       | 00:00:00  
  ...ral-sys-history-scanner-workflow | temporal-sys-history-scanner | 064d3119-23b1-4e1f-a53c-e4d69f09f84a | temporal-sys-history-scanner-taskqueue-0 | 00:00:27   | 12:00:00       | 00:00:00  
  ...temporal-sys-tq-scanner-workflow | temporal-sys-tq-scanner      | 8fef3b86-c282-413f-b95a-e068c5c5b21f | temporal-sys-tq-scanner-taskqueue-0      | 12:00:00   | 00:00:00       | 00:00:00

When trying to describe the workflow with RunId 8fef3b86-c282-413f-b95a-e068c5c5b21f, I got indeed the NOT_FOUND exception. The other task queue scanner workflow seemed to be fine, as I was able to get more details about it. There are also server logs mentioning the faulty workflow:

"ts": "2023-09-25T00:00:00.425Z",
"msg": "Workflow task not found", 
"service": "matching",
"error": "Workflow executionsRow not found. WorkflowId: temporal-sys-tq-scanner, RunId: 8fef3b86-c282-413f-b95a-e068c5c5b21f"

Similar logs can be found in the history service as well. I also see that metric reporting workflow timeouts was increased at the time when the workflow was supposed to be scheduled (at midnight).

Shouldn’t it be disallowed to execute two workflows of the same ID? Or are there different rules for schedules/workflows that run in temporal-system namespace? The workflow seems to be stuck (or at least its reference), while other executions of the same type/workflow-id continue to run properly (I can trace the reschedules for the other scanner workflow)

Workflow’s run timeout is set to 5 days. I listed the workflows after that period and saw that 2 new temporal-sys-tq-scanner were scheduled, and again, one of them doesn’t have any execution.

What do you suggest could be the cause of this behaviour? What can we do to fix it?

Thanks in advance for your help.

Topic		Replies	Views
NOT_FOUND: Namespace id "xxxxxx-xxxxx-xxxxx-xxxxx-xxxxxxxxxxxx" not found Community Support namespace	10	2502	November 12, 2021
Error: Failed to run workflow. Error Details: Namespace <id_of_default_namespace> does not exist Community Support java-sdk , mysql	2	1252	March 5, 2021
Handling Internal Temporal Exceptions Without Disrupting Workflow Execution Community Support java-sdk , logging , exception	2	31	April 1, 2025
Missing data on workflow running on Temporal Web UI Community Support web-ui	3	853	July 12, 2022
Temporal is reporting cannot find namespace Community Support server	3	435	March 22, 2024

Running workflows without execution in temporal-system namespace

Related topics