We have an application that uses temporal to manage many parallel activities that are called by one workflow - for example we have on workflow that simultaneously needs to kick off 5k activities.
After man hours of changing many settings, we cannot seem to reduce the latency for these activities being started.
We’ve run a simple experiment - 1 workflow with 1k parallel executed activities that simply wait for 1 second - and its taking on average 1 minute for it to finish.
Things we’ve tried without success:
- increasing workers
- Increasing poller number
- increasing history shards
- increasing concurrency limit on activities
No matter what after setting up grafana monitoring and just looking at what’s happening, we get 1k activities that get picked up relatively immediatley but just waqit to start for a very long time as a bottleneck gets worked through.
Any help would be greatly appreciated - i’ve now spent 2 days trying to figure out this bottleneck without success