Scaling Strategy For Workers With Rate-Limited Task Queues

My team has requirements to rate-limit requests to our dependencies to RPS values that we agreed upon with the dependency owners. We thought it would be easiest to implement this rate-limiting by creating dedicated task queues for each activity type and using setMaxTaskQueueActivitiesPerSecond.

We are using defaults of 200 for max concurrent workflow/activity execution sizes.

I’m seeking advice for how to best scale our workers. We are scaling based off of Prometheus metrics.

  • It seems that the activity schedule to start latency metric would normally be a good metric to use to gauge whether workers are becoming backlogged with tasks or not, but this metric is no longer relevant to us if the task queues are being rate limited to lower values than the max concurrent workflow/activity execution sizes.
  • If we scale based off of memory/cpu only then the workers won’t necessarily scale when all of their slots are filled with workflow/activity executions if those executions are not very intensive.
  • It looks like task queues have ApproximateBacklogSize but we would need to expose this for every task queue and scrape it to use it as a Prometheus metric.

Third option seems like the “most correct” solution but I’m not sure we would have time to implement and test it. So, please let me know if I’m missing anything, and any advice is appreciated.

  • It seems that the activity schedule to start latency metric would normally be a good metric to use to gauge whether workers are becoming backlogged with tasks or not, but this metric is no longer relevant to us if the task queues are being rate limited to lower values than the max concurrent workflow/activity execution sizes.

Correct, by setting task queue dispatch limits you are creating a natural backlog of activity tasks

  • If we scale based off of memory/cpu only then the workers won’t necessarily scale when all of their slots are filled with workflow/activity executions if those executions are not very intensive.

Believe also correct, question is to find the max number of activity task slots a single worker can process before hitting your cpu limits. It’s something you could load-test and monitor your temporal_worker_task_slots_available metric for worker_type=ActivityWorker. If you are at 200 and you are at like 10% cpu then i think you should consider increasing your worker activity task slots.

  • It looks like task queues have ApproximateBacklogSize but we would need to expose this for every task queue and scrape it to use it as a Prometheus metric.

Yeah there is a sample here if helps

You could also consider scaling on temporal_worker_task_slots_available for worker_type=ActivityWorker, if reaches 0 for a chosen duration of time.