I have multiple different workers each of which is set to execute different sets of activities.
One of these workers has a maxConcurrrency set to 6 and I run 10 copies of this worker.
I am tracking the number of available activity slots gauge metric for the worker and it starts at 6*10=60 and fluctuates during execution as expected.
But over time, there’s a gradual decline in the count and after sometime the total activity slots never recovers to 60. After a few days, the number drops to zero and this worker doesn’t pick up any tasks to execute…
This doesn’t happen to other worker types that I have. What could be the issue with this particular worker and are there any flags that I can enable to better debug this?
sdk version: go.temporal.io/sdk v1.12.0
This is a new metric and it is quite possible there is a bug in the calculation. I will investigate. Can you confirm all of your activity functions are properly terminating?
It’s not a problem with the metric. The workers just stop doing things after a couple of days.
I suspect it’s an issue with the activity. But I don’t see any PanicError or panic. I’ve also checked my logs for the error strings in the processTask method.
I see, glad the new metric is helping show this
Is it possible that your activity is hanging/deadlocked in any way? Is it possible you have activity functions that never return?
The context passed into the activity will be cancelled when the activity is cancelled provided you are heartbeating regularly. Heartbeating inside your activity is strongly recommended (coupled with a heartbeat timeout in the activity options) to be notified of cancellations and otherwise inform the server that the activity is still executing.
No heartbeat I do have a 15 minute start to close timeout on the activities in question.
Wouldn’t the activity slot be restored after 15 minutes even if there’s a deadlock/the activity doesn’t return?