Workflow tasks stuck after db outage

Hello!

We’ve had a short outage of our persistence db. During that time some workflow tasks got stuck. It seems like after we got everything up and running, they’re just not getting inserted into queues again. Why do these never get timed out? I managed to reset some of them but some of them got an error, all of these only have two items in their history. How do I reset workflows like that? How do I prevent this from happening? Is it expected that when DB starts lagging these zombie workflows are created?

Temporal: 1.22.2
DB: Aurora Postgres 15.4 (AWS)

Workflow task finish ID must be > 1 && <= workflow last event ID.

Nor our metrics, nor anything else reports these tasks/activities are pending.

Data:

Workflow stuck:

/etc/temporal$ tctl adm tq desc --taskqueue saga_command_bus --tqt workflow
READ LEVEL | ACK LEVEL | BACKLOG | LEASE START TASKID | LEASE END TASKID  
    11601434 |  11601434 |       0 |           11600001 |         11700000  

                WORKFLOW POLLER IDENTITY                |   LAST ACCESS TIME    
  saga_command_bus:8c75b334-03a3-4264-80dd-fc3f082400f3 | 2023-12-05T13:48:41Z  
  saga_command_bus:873c92dd-6248-4fc9-9fbd-5e700dca8388 | 2023-12-05T13:48:41Z  
  saga_command_bus:2bd056e5-59a8-4c3d-b684-8e28d4de18a4 | 2023-12-05T13:48:41Z  
  saga_command_bus:0d839679-4ec0-43ad-928c-6463ee3130f6 | 2023-12-05T13:48:41Z  
  saga_command_bus:efa0a885-6f07-4021-bdb1-fb5d39d485fe | 2023-12-05T13:48:41Z  
  saga_command_bus:4c20f491-0a34-43cc-a9c4-955a78a32c77 | 2023-12-05T13:48:41Z  
  saga_command_bus:b9d46eab-49d4-40d0-a3d1-fe09f193b4c9 | 2023-12-05T13:48:41Z  
  saga_command_bus:ae9dcae6-fed8-4a25-b748-54f24d408f84 | 2023-12-05T13:48:41Z  
  saga_command_bus:a4e9cb05-b23b-4995-8ede-f7013c6bd8fc | 2023-12-05T13:48:41Z  
  saga_command_bus:1799ba7e-0995-405e-92e4-99539d7eb6a6 | 2023-12-05T13:48:40Z  
  saga_command_bus:8a7339b3-d3a1-4804-be6d-4c23a58f3e69 | 2023-12-05T13:48:40Z  
  saga_command_bus:819bd814-8bd3-45b7-bfb8-f03595460121 | 2023-12-05T13:48:40Z  
  saga_command_bus:0d568eed-8039-48c0-b230-729753cdd8f5 | 2023-12-05T13:48:40Z  
  saga_command_bus:99c8b076-276a-4ae7-80e6-bf0d634c0dd2 | 2023-12-05T13:48:40Z  
  saga_command_bus:fbef68b7-c08a-4695-b67f-ea36bc702f22 | 2023-12-05T13:48:40Z  
  saga_command_bus:30f6c2a2-0de2-41b9-bea5-f8ca69f7752c | 2023-12-05T13:48:40Z  
  saga_command_bus:ec10623e-8dfc-48e7-9105-07092fa9774e | 2023-12-05T13:48:40Z  
  saga_command_bus:8489eaba-d68d-44f5-8957-37fd87945bad | 2023-12-05T13:48:40Z  
  saga_command_bus:29f5139c-62c8-4f0c-a9cf-85be2e738c62 | 2023-12-05T13:48:39Z  
  saga_command_bus:ecaa5372-504a-4091-9570-8177be338a20 | 2023-12-05T13:48:39Z  
  saga_command_bus:ea8ed384-b065-465c-9661-8dc80c5607bd | 2023-12-05T13:48:39Z  
  saga_command_bus:181c7d03-6a84-498e-aabe-8474387c5cf0 | 2023-12-05T13:48:39Z  
  saga_command_bus:93034fe6-3694-400f-8a61-997c5582f213 | 2023-12-05T13:48:38Z  

I’ve managed to reset them by first sending a signal, and then resetting them from that event. Now the question remains, how do we mitigate this upon db failure? Are there any logs for this? Is this expected, or is something wrong with out configuration?

1 Like