Hello!
I’m new to Temporal and looking for guidance on a situation I encountered today. What should I do differently next time?
For irrelevant reasons, I found myself in the position of needing to get rid of ~20K workflows. I used a batch terminate operation from the UI to do this.
When the operation was done (and a subsequent operation to handle late arriving workflows), remaining workflows were ignored for 45m-60m. While trying to figure out what was going on, I noticed that the matching service was issuing a lot of logs to the effect of “Workflow task not found: workflow execution already completed”
I used the Temporal CLI and found that the task queue in question seemed to be working through an invisible backlog of some sort. The CLI output looked like so:
$ temporal task-queue describe --task-queue MY_TASK_QUEUE
Task Queue Statistics:
BuildID TaskQueueType ApproximateBacklogCount ApproximateBacklogAge BacklogIncreaseRate TasksAddRate TasksDispatchRate
UNVERSIONED workflow 17237 1h 57m 40.626115768s -3.4689913 0 3.4689913
UNVERSIONED activity 0 0s 0 0 0
Pollers:
BuildID TaskQueueType Identity LastAccessTime RatePerSecond
UNVERSIONED workflow 1@temporal-workflow-worker-deployment-5cbbc4997d-jkl7z 1 minute ago 100000
UNVERSIONED activity 1@temporal-workflow-worker-deployment-5cbbc4997d-jkl7z 32 seconds ago 100000
I couldn’t figure out a way around this. I eventually increased replicas of the workflow worker and matching engine deployments to try to work through the backlog faster.
What is going on? What should I have done differently? Perhaps requesting cancellation instead of terminating would have helped, but in some ways that strikes me as deciding between waiting now and waiting later.
Thanks in advance for any suggestions!