@nithin I was able to reproduce your problem locally, it looks like this issue can occur when activity handlers can not keep up with pollers, in your case you use default number of pollers (5) and reduce number of handlers from default (200) to 1.What happens is that activity tasks are getting polled, server starts counting time for the activity heartbeat, but instead of being processed right away they are waiting for handler capacity to become available and when it finally becomes available it could be already too late because heartbeat timeout might have passed. We consider this a bug and it will be fixed in the coming release.
Meanwhile for temporary relief you may try adding setActivityPollThreadCount(1) property in your worker options which should reduce frequency of the exception or even make it go away.
@Vitaly Thanks for looking more into this. I tried setting set ActivityPollThreadCount(1) couple days back. This reduced the number of errors, but we are still seeing them.
That’s to be expected, number of errors in this case would be proportional to the number of pollers. So by changing it from 5 to 1 you reduce it roughly by a factor of 5x. I’m working on a proper fix.
My expectation is that you should no longer be blocked by this issue as doing heartbeating early on would simply fail the activity execution that was queued for too long, which should result in a retry with no actual work being done in the failed attempt.
Unfortunately we heartbeat in a separate wrapper thread so right now we dont kill the activity when the heartbeat fails.
So to summarize there are 2 issues here:
When the activity_poll_thread_count is greater than the concurrent_executions, the poller is fetching more tasks than the handler can processing some of which times out. This can be fixed by us by setting activity_poll_thread_count <= concurrent_executions.
The poller keep polling tasks even though the handler has not completed activity execution. This the bug that you ll be fixing in the next release