Workflow tasks stuck after db outage

Zigmas_Satkevicius · December 5, 2023, 3:40pm

Hello!

We’ve had a short outage of our persistence db. During that time some workflow tasks got stuck. It seems like after we got everything up and running, they’re just not getting inserted into queues again. Why do these never get timed out? I managed to reset some of them but some of them got an error, all of these only have two items in their history. How do I reset workflows like that? How do I prevent this from happening? Is it expected that when DB starts lagging these zombie workflows are created?

Temporal: 1.22.2
DB: Aurora Postgres 15.4 (AWS)

Workflow task finish ID must be > 1 && <= workflow last event ID.

Nor our metrics, nor anything else reports these tasks/activities are pending.

Data:

Workflow stuck:

/etc/temporal$ tctl adm tq desc --taskqueue saga_command_bus --tqt workflow
READ LEVEL | ACK LEVEL | BACKLOG | LEASE START TASKID | LEASE END TASKID  
    11601434 |  11601434 |       0 |           11600001 |         11700000  

                WORKFLOW POLLER IDENTITY                |   LAST ACCESS TIME    
  saga_command_bus:8c75b334-03a3-4264-80dd-fc3f082400f3 | 2023-12-05T13:48:41Z  
  saga_command_bus:873c92dd-6248-4fc9-9fbd-5e700dca8388 | 2023-12-05T13:48:41Z  
  saga_command_bus:2bd056e5-59a8-4c3d-b684-8e28d4de18a4 | 2023-12-05T13:48:41Z  
  saga_command_bus:0d839679-4ec0-43ad-928c-6463ee3130f6 | 2023-12-05T13:48:41Z  
  saga_command_bus:efa0a885-6f07-4021-bdb1-fb5d39d485fe | 2023-12-05T13:48:41Z  
  saga_command_bus:4c20f491-0a34-43cc-a9c4-955a78a32c77 | 2023-12-05T13:48:41Z  
  saga_command_bus:b9d46eab-49d4-40d0-a3d1-fe09f193b4c9 | 2023-12-05T13:48:41Z  
  saga_command_bus:ae9dcae6-fed8-4a25-b748-54f24d408f84 | 2023-12-05T13:48:41Z  
  saga_command_bus:a4e9cb05-b23b-4995-8ede-f7013c6bd8fc | 2023-12-05T13:48:41Z  
  saga_command_bus:1799ba7e-0995-405e-92e4-99539d7eb6a6 | 2023-12-05T13:48:40Z  
  saga_command_bus:8a7339b3-d3a1-4804-be6d-4c23a58f3e69 | 2023-12-05T13:48:40Z  
  saga_command_bus:819bd814-8bd3-45b7-bfb8-f03595460121 | 2023-12-05T13:48:40Z  
  saga_command_bus:0d568eed-8039-48c0-b230-729753cdd8f5 | 2023-12-05T13:48:40Z  
  saga_command_bus:99c8b076-276a-4ae7-80e6-bf0d634c0dd2 | 2023-12-05T13:48:40Z  
  saga_command_bus:fbef68b7-c08a-4695-b67f-ea36bc702f22 | 2023-12-05T13:48:40Z  
  saga_command_bus:30f6c2a2-0de2-41b9-bea5-f8ca69f7752c | 2023-12-05T13:48:40Z  
  saga_command_bus:ec10623e-8dfc-48e7-9105-07092fa9774e | 2023-12-05T13:48:40Z  
  saga_command_bus:8489eaba-d68d-44f5-8957-37fd87945bad | 2023-12-05T13:48:40Z  
  saga_command_bus:29f5139c-62c8-4f0c-a9cf-85be2e738c62 | 2023-12-05T13:48:39Z  
  saga_command_bus:ecaa5372-504a-4091-9570-8177be338a20 | 2023-12-05T13:48:39Z  
  saga_command_bus:ea8ed384-b065-465c-9661-8dc80c5607bd | 2023-12-05T13:48:39Z  
  saga_command_bus:181c7d03-6a84-498e-aabe-8474387c5cf0 | 2023-12-05T13:48:39Z  
  saga_command_bus:93034fe6-3694-400f-8a61-997c5582f213 | 2023-12-05T13:48:38Z

Zigmas_Satkevicius · December 5, 2023, 3:57pm

I’ve managed to reset them by first sending a signal, and then resetting them from that event. Now the question remains, how do we mitigate this upon db failure? Are there any logs for this? Is this expected, or is something wrong with out configuration?

Topic		Replies	Views
Stuck workflows after hight database load Community Support general-impl	11	432	July 11, 2024
Workflow is stuck with issue WorkflowTaskTimeout Server Deployment	0	244	March 8, 2024
Workflow task timeout after activity is completed Community Support go-sdk	5	54	April 21, 2025
Workflow Activity gets stuck intermittently Community Support python-sdk , general-impl , activity , postgresql , workflow-implementat	7	111	March 29, 2025
Occasionally workflow task won't be started after scheduled Community Support	16	611	November 9, 2022

Workflow tasks stuck after db outage

Related topics