We are having some issues with workers. The env is used for tests, so load is very limited. Need a guidance on how to check what happened.
On Friday we noticed that some Workflows were “stuck” in activities that shouldn’t fail ( the activity are not dependent of external apis yet, so shouldn’t fail or timeout ). During the weekend, the task was finally processed, but took 13h to complete :
The number of attempts makes sense because of the 1h timeout of scheduleToStartTimeout and 1h for startToCloseTimeout, but the reason for being “stuck” is a mistery.
This morning we continued the search for answer about this and noticed that new workflows are not being processed by the worker.
Seems that the worker is not handling Workflows Tasks
I don’t think we can help with the activity timeout. You have to understand what causes it. I would recommend looking into the activity worker logs to see why the activity never completes. Apparently, its threads are stuck on something. So thread dump might help to troubleshoot.
As far as the second issue.
How do you initialize your worker? Do you have a single worker for both activities and workflows?
I would be also careful with MaxConcurrentActivityExecutionSize which might choke the activity execution pool.
I would recommend omitting properties that you don’t plan to change.
Apparently the worker wasn’t doing anything, but the issue disappeared for new workflows after changing the Activities timeouts params ( heartbeat was missing, ScheduleToStart removed an set the ScheduleToClose instead ( StartToClose was already set ) )