I’m currently experiencing an issue with my activity workers not recovering their task slots, despite having zero workflows in progress.
Here’s a brief overview of my setup: I have 8 activity workers deployed, each configured with a MaxConcurrentActivityExecutionSize of 30. Over time, I’ve observed that the task slots are being utilized but they never seem to recover back to 30, even when there is no apparent workload.
More critically, if I don’t manually switch off workflow submission, this situation progressively worsens to the point where all workers end up with zero available task slots. This leaves them incapable of picking up any new work, which completely halts my operation until I manually redeploy the workers.
I’ve attached two screenshots to illustrate the issue more clearly. The first screenshot shows the current state of my system, with no workflows in progress (this was achieved by manually turning off workflow submission). However, the second screenshot shows that my workers still have their available task slots reduced and not recovering to the maximum limit of 30.
I’ve been wracking my brain trying to figure out why this is happening and how to rectify it. Could this be a configuration issue? Or maybe something to do with how tasks are being managed internally?
Any suggestions or insights would be greatly appreciated. Thanks in advance for your help!
This is consistent with my suspicions. I believe I’ve ruled that out, though, as I’ve designed a wrapper for my activity code that guarantees it will always return even if there’s blocking code in the business logic.
While this solution does come with the trade-off of potentially orphaning a go routine executing the actual work, it seemed necessary to rule out the option that the activity is not completing.
Here’s a simplified activity code that I’ve wrapped in this manner:
func MyActivity(input MyInput) (*MyOutput, error) {
deadline := time.Now().Add(activityStartToCloseTimeout * 9 / 10) // 90% of the configured start to close timeout
resultChan := make(chan *MyOutput, 1)
errChan := make(chan error, 1)
go func() {
result, err := run(input) // run is the actual activity implementation
if err != nil {
errChan <- err
} else {
resultChan <- result
}
close(resultChan)
close(errChan)
}()
// wait for the activity to complete (succes or error), or for the deadline to expire
select {
case result := <-resultChan:
return result, nil
case err := <-errChan:
return nil, err
case <-time.After(time.Until(deadline)):
return nil, temporal.NewNonRetryableApplicationError("activity non retriable timeout", "ErrTimeout", errors.New("canceled due to timeout"))
}
}
I hope this gives a clearer picture of my problem. If you could provide further insights or suggestions, I’d appreciate it. Thanks again for your help!
Side Question: Can a workflow run be closed when its activity is never completed?
I have profiling setup with Datadog, and I’ve been closely monitoring it, but so far, I haven’t noticed anything particularly unusual.
Considering the ongoing issues, I decided to start afresh with a new deployment and schema for the server. I was concerned that perhaps there was a misstep in the initial setup that could be causing this problem. The issue has not yet recurred 24 hours into this new setup. However, I understand it might still be too early to conclude anything definitive.
While I continue monitoring the situation, I would like to clarify a key point: Assuming my activity functions return as designed, is it theoretically possible for a task slot to remain occupied? Or, can we confidently assert that the function’s return should always correspond to the freeing of its associated task slot? Maybe the SDK is waiting for all goroutines to finish before it frees the slot?
I managed to identify the culprit. It turned out that the contention was happening in a completely different activity from where I initially suspected. I was pretty much looking in the wrong place.
This rogue activity was indeed blocking, just as you suggested. I’ve applied the same workaround I posted above and the task slots are now freeing up as expected after the activities return.