We have nearly 20K workflows as cronjobs which are scheduled to run (all at once) at 12AM everyday.
This caused a memory spike for an extended period of time, and we concluded its because of defaults being set for cached_workflows count
at 1000. Bringing this down to 0 did solve memory hogging but spiked up CPU utilization for an extended period of time.
We then tried restricting task slots for workflowtasks and activitytasks to a very low value (10) and increased cached_workflows count to 10, hoping to see a balance. But this spiked up both CPU and memory.
The sticky cache (cached_workflows) cache hit rate is very low, and I think its because workflow tasks and activity tasks are being fetched randomly from the entire pool of 20K triggered workflows, and is not limiting the tasks to the ones belonging to cached workflows.
Is there any recommended solution to this?
Any option to let the worker prioritize cached workflows before polling for more tasks? or throttle the number of triggered workflows?
If nothing works, I guess we should write a “master” workflow which will do the throttling by code, starting N workflows at a time, controlling the time at which each of the workflows are started (started off in an activity), not starting a new one unless one workflow has completed
What do you guys think?