Activity Workers Not Recovering Task Slots, Eventually Leading to Zero Availability

Hello everyone,

I’m currently experiencing an issue with my activity workers not recovering their task slots, despite having zero workflows in progress.

Here’s a brief overview of my setup: I have 8 activity workers deployed, each configured with a MaxConcurrentActivityExecutionSize of 30. Over time, I’ve observed that the task slots are being utilized but they never seem to recover back to 30, even when there is no apparent workload.

More critically, if I don’t manually switch off workflow submission, this situation progressively worsens to the point where all workers end up with zero available task slots. This leaves them incapable of picking up any new work, which completely halts my operation until I manually redeploy the workers.

I’ve attached two screenshots to illustrate the issue more clearly. The first screenshot shows the current state of my system, with no workflows in progress (this was achieved by manually turning off workflow submission). However, the second screenshot shows that my workers still have their available task slots reduced and not recovering to the maximum limit of 30.

I’ve been wracking my brain trying to figure out why this is happening and how to rectify it. Could this be a configuration issue? Or maybe something to do with how tasks are being managed internally?

Any suggestions or insights would be greatly appreciated. Thanks in advance for your help!


SKD Version: v1.23.0
Temporal Server Version: v1.20.2

It usually happens when an activity implementation never completes and thus never releases a task slot to the worker.

I would recommend getting goroutine dump to see why the activities are stuck.

Thank you for your response.

This is consistent with my suspicions. I believe I’ve ruled that out, though, as I’ve designed a wrapper for my activity code that guarantees it will always return even if there’s blocking code in the business logic.

While this solution does come with the trade-off of potentially orphaning a go routine executing the actual work, it seemed necessary to rule out the option that the activity is not completing.

Here’s a simplified activity code that I’ve wrapped in this manner:

func MyActivity(input MyInput) (*MyOutput, error) {
	deadline := time.Now().Add(activityStartToCloseTimeout * 9 / 10)     // 90% of the configured start to close timeout

	resultChan := make(chan *MyOutput, 1)
	errChan := make(chan error, 1)

	go func() {
		result, err := run(input)    // run is the actual activity implementation
		if err != nil {
			errChan <- err
		} else {
			resultChan <- result
		}
		close(resultChan)
		close(errChan)
	}()

	// wait for the activity to complete (succes or error), or for the deadline to expire
	select {
	case result := <-resultChan:
		return result, nil
	case err := <-errChan:
		return nil, err
	case <-time.After(time.Until(deadline)):                   
		return nil, temporal.NewNonRetryableApplicationError("activity non retriable timeout", "ErrTimeout", errors.New("canceled due to timeout"))
	}
}

I hope this gives a clearer picture of my problem. If you could provide further insights or suggestions, I’d appreciate it. Thanks again for your help!

  • Side Question: Can a workflow run be closed when its activity is never completed?

The code looks reasonable, but I would still look at the goroutine dump to see if there is an issue.

Thank you for your feedback, Maxim.

I have profiling setup with Datadog, and I’ve been closely monitoring it, but so far, I haven’t noticed anything particularly unusual.

Considering the ongoing issues, I decided to start afresh with a new deployment and schema for the server. I was concerned that perhaps there was a misstep in the initial setup that could be causing this problem. The issue has not yet recurred 24 hours into this new setup. However, I understand it might still be too early to conclude anything definitive.

While I continue monitoring the situation, I would like to clarify a key point: Assuming my activity functions return as designed, is it theoretically possible for a task slot to remain occupied? Or, can we confidently assert that the function’s return should always correspond to the freeing of its associated task slot? Maybe the SDK is waiting for all goroutines to finish before it frees the slot?

Assuming my activity functions return as designed, is it theoretically possible for a task slot to remain occupied?

That would be a bug that I"m not aware of.

I would still recommend performing a goroutine dump of the process to troubleshoot.

I managed to identify the culprit. It turned out that the contention was happening in a completely different activity from where I initially suspected. I was pretty much looking in the wrong place.

This rogue activity was indeed blocking, just as you suggested. I’ve applied the same workaround I posted above and the task slots are now freeing up as expected after the activities return.