Detailed Context and Problem Statement
We have approximately close to 2000 workers per namespace in a cluster (2 namespace to be specific), with each worker associated with a unique task queue. These workers are long-lived and we intentionally do not stop them after workflow completes.
However, this setup is causing some serious issues:
- We’re frequently encountering
ResourceExhausted
errors for these namespaces which has more number of worker and task queues. - Our database writer instance is under heavy load, with CPU utilization spiking to 98–99%. (DB specs:
db.r6g.large
)
It seems like the combination of long running huge worker count and associated unique task queue might be contributing to this resource exhaustion and database strain.
Has anyone else faced similar challenges? Would love to hear suggestions on:
- How to better manage a large number of long-lived workers/task queues.
- Best practices for reducing pressure on the DB and avoiding
ResourceExhausted
errors. - Whether our DB instance type may be insufficient for this scale. (if not whats the recommendation for scaling) ?
Any advice or architectural guidance would be greatly appreciated!