Invisible task queue backlog after mass cancellation

Hello!

I’m new to Temporal and looking for guidance on a situation I encountered today. What should I do differently next time?

For irrelevant reasons, I found myself in the position of needing to get rid of ~20K workflows. I used a batch terminate operation from the UI to do this.

When the operation was done (and a subsequent operation to handle late arriving workflows), remaining workflows were ignored for 45m-60m. While trying to figure out what was going on, I noticed that the matching service was issuing a lot of logs to the effect of “Workflow task not found: workflow execution already completed”

I used the Temporal CLI and found that the task queue in question seemed to be working through an invisible backlog of some sort. The CLI output looked like so:

$ temporal task-queue describe --task-queue MY_TASK_QUEUE
Task Queue Statistics:
BuildID TaskQueueType ApproximateBacklogCount ApproximateBacklogAge BacklogIncreaseRate TasksAddRate TasksDispatchRate
UNVERSIONED workflow 17237 1h 57m 40.626115768s -3.4689913 0 3.4689913
UNVERSIONED activity 0 0s 0 0 0
Pollers:
BuildID TaskQueueType Identity LastAccessTime RatePerSecond
UNVERSIONED workflow 1@temporal-workflow-worker-deployment-5cbbc4997d-jkl7z 1 minute ago 100000
UNVERSIONED activity 1@temporal-workflow-worker-deployment-5cbbc4997d-jkl7z 32 seconds ago 100000

I couldn’t figure out a way around this. I eventually increased replicas of the workflow worker and matching engine deployments to try to work through the backlog faster.

What is going on? What should I have done differently? Perhaps requesting cancellation instead of terminating would have helped, but in some ways that strikes me as deciding between waiting now and waiting later.

Thanks in advance for any suggestions!

It sounds like you had a backlog of tasks for the canceled workflows. These tasks had to be dropped, which required a DB call. 20k DB calls shouldn’t take an hour unless your system is severely underprovisioned or doesn’t have enough history shards.

Thanks for the reply @maxim!

It seems like you are right, each of the log messages referenced a different task ID.

Any suggestions for how I might reduce the blast radius of something like this in the future? Perhaps I could have done something to make it so fewer workflows had pending tasks, but I’m not sure what I might do to control that.

What is responsible for handling the pending tasks? I’d guess the workers assigned to the task queue must handle the pending tasks, so scaling up the workers working on that task queue should help work through the backlog faster, modulo the number of available partitions.

You are right that my current setup is somewhat under provisioned in that it uses PostgreSQL for the backend DB. That said, it didn’t seem like PostgreSQL was under any significant load during the period where the pending tasks were being digested, so I’m not sure how much to blame PostgreSQL.

I’d rather be taught to fish than given a fish, so if there are any reference materials anyone would recommended to help improve my self service ability, I’d be happy to take a look at them.

Thanks for the help!

How many shards does the cluster use?

I’ve deployed to k8s using the official Helm chart. I didn’t make any adjustments related to sharding, so it seems like the default answer is 512.

You probably want to look at the DB latencies and other metrics to further investigate this.