We have one task queue in our Temporal cluster that appears to be stuck in an odd state – activities scheduled to it sometimes get scheduled and finished instantly, but sometimes just get scheduled but don’t get picked up by a worker until hours later. We’re seeing the schedule-to-start latency on this queue constantly increase, but it doesn’t appear that we’re bottlenecked on any other metric:
Our workers are underutilized (low CPU/mem & high task slots available)
We’re processing tasks at far less than 400/second on this queue (closer to 50/second), even though we’ve seen throughputs upwards of 800 tasks/second in the past
We have no resource exhausted errors
Our underlying persistence store is healthy
We’re on the latest versions of both Temporal and the Typescript SDK for our workers
This state happened after a period of extremely high load, during which we tried to add more partitions to the task queue (going from 4 → 8), and then reverted that change since it seemed to decrease task throughput.
Resetting some of the stuck workflows in an attempt to reschedule the activities works for some executions, but not others.
What might be going wrong, and how can we fix it?
UPDATE: this seemed to resolve itself after several more hours of slowly processing old tasks. We tried tuning history.[timer|transfer]ProcessorMaxPollHostRPS to 6000, but this didn’t seem to have any effect. Our current hypothesis is that the activity tasks that failed to schedule during the period of high load due to running into matching RPS limits entered some sort of exponential backoff for scheduling that lasted multiple hours – does this seem like a reasonable conclusion? If it is, is this behavior tunable?
After doing a deep dive into Temporal source & meeting with some folks from Temporal, we concluded the following:
When the history service hits ResourceExhausted errors trying to schedule activities or workflow tasks to the matching service, it does an in-memory back-off/retry capped at 5 minutes per retry.
However, if the history service process owning a shard dies or run out of resources (potentially evicting in-memory workflows? ← unconfirmed), then the new process that picks up that shard will start to slowly scan from persistence all the active executions for that shard.
This scan happens by default at a rate of 100 tasks or workflows (unsure which) per minute, which is why we were seeing the incredibly slow rate of our activity task backlog being scheduled & processed.
We’ve made a couple changes to mitigate these issues:
Increased matching rps limit – it appears that our matching pods can handle more requests without causing any issues/running into the persistence QPS limit, so increasing this limit helped with task throughput.
Moved our Temporal server pods to on-demand EC2 instances from Spot instances – this should lessen the frequency with which we need to scan from persistence
Tuned our Temporal server pod autoscaling parameters to scale down 1 pod per 2 minutes at most – same as above
Tuned the persistence scanning process – so that when we do hit limits from spiky loads, we can still churn through tasks quickly without entering a state of large backlog + underutilized resources
These are the values we’re currently settled on:
"history.transferTaskBatchSize" = [{
# Default 100
value = 400
}]
"history.transferProcessorMaxPollRPS" = [{
# Default 20
value = 500
}]
"history.transferProcessorMaxPollInterval" = [{
# Default 1 min
value = "30s"
}]
"history.timerTaskBatchSize" = [{
# Default 100
value = 400
}]
"history.timerProcessorMaxPollRPS" = [{
# Default 20
value = 500
}]
"history.timerProcessorMaxPollInterval" = [{
# Default 5 min
value = "30s"
}]
"matching.rps" = [{
# Default 1200
value = 3600
}]
We haven’t seen load like the one causing this post since we’ve tuned the parameters, but it does appear that our cluster is able to handle small spikes in workload better than before (we no longer have a long tail of workflows that take a while to finish when we see a spike)