Need help understanding which parameters to tune for scaling


We have a bursty workload where we occasionally kick off a bunch of workflows that each kick off a bunch of activities in a specific task queue (approximate numbers in graphs below). Each of these activities has a fairly short runtime (on average <5 ms).

We’re observing behavior where we scale our activity workers up to 100 pods (each with one process) based on activity schedule time – however, each worker only seems to be processing 2-3 tasks concurrently (based off available activity task slots) and are underutilized in both CPU and memory. Yet, average activity schedule time on this task queue continues to increase.

Graphs (in 1 minute increments) of activities received, schedule_to_start_latency, and resources exhausted. Our activities processed graph follows the activities received graph almost exactly.

From my understanding, there’s a couple parameters that we can tune here:

  • # of partitions for the task queue
  • Number of pollers on our activity workers
  • Matching RPS dynamic config

What behavior would we expect to change with tuning each of these parameters? How does the schedule time metric keep going up even when there are barely any new activities coming in?

It seems like maybe there’s an exponential backoff to schedule the activities when we hit the rps limit for AddActivityTask on the matching service – is this what’s causing the observed behavior where we have ample resources provisioned after the initial spike but are unable to process the activities at the expected rate?