I have a dedicated worker service for a queue. There are two workflows:
StartAllPromotionWorkflows is responsible for getting active promotions (e.g., weekly discounts) and starting workflows via signal. This includes signaling already-running workflows. This runs every 15 minutes.
PromotionWorkflowV1 actually runs the promotion, which is outlined below.
PromotionWorkflowV1
Load the data from the database.
Sleep until the promotion starts.
Update the database to start the promotion (e.g., change the status field).
Sleep a few days to a couple years until the promotion ends.
Update the row in the database.
End.
We have about 475 of these workflows running, and are seeing issues with activities not starting quickly after being scheduled (e.g., many hours later, if at all). We are currently using UpdatableTimer, but also saw this issue when using standard sleep.
CPU utilization is rarely above 1%. We’ve increased max concurrent activities from 100 to 200, and now 1000. We get to zero available activity workers within a couple hours.
How can we resolve this? Continuing to increase maxConcurrentActivityTaskExecutions seems wrong, and a great way to end up with another incident when we forget to raise it again.
Timeouts are in place so we have something to alert on. Without a timeout, we never know an activity is stuck until it finally starts and emits a schedule-to-start latency metric value.
When pods are restarted, all the workers start with all activity slots available. Over time, activities that don’t return consume more and more slots. Workers don’t poll for tasks when there are no slots available, so they “disappear” from the service.