We are deploying a worker that host an activity whose MaxConcurrentWorkflowTaskPollers is set to 1. We do this because this specific activity is memory intensive and we do not want any other activity gets executed when another one is already running for another workflow. The activity sends a heartbeat every few seconds and it is cancelable.
Now the problem is that if the workflow gets cancelled, the activity gets properly cancelled but the worker will never be assigned to another waiting workflows (who needs the same activity). The weird part is that if when we look at the “task-queues” for the UI Pollers page, it shows the ACTIVITY HANDLER checked and reports the activity worker to be available however no new work is being assigned to this worker whose activity just got cancelled. The situation stays the same until temporal server completely stop reporting the activity worker in the Pollers (after maybe 10-15 min) however the worker process is alive and looks healthy.
Also, we see a similar situation if the activity takes a long time to complete (like an hour). At that point, the activity finishes up, then the workflow gets completed but no more work will be assigned to that worker anymore.
During the time that the activity is running, the last event is ActivityTaskScheduled and the state is PENDING_ACTIVITY_STATE_STARTED. When the activity finishes up, then the ActivityTaskStarted shows up in the ui with following events.
1> What could be the cause of this?
2> Is the anyway for the worker to check its status with the temporal server and re-register itself with the server or something?
no, this is the only worker polling from this queue.(we want to make sure a single worker works as expected first before scaling this up)
yes, there are a whole bunch of other workflows that need the same activity on this worker. The worker is supposed to be assigned to them after it is cancelled but it is not
You mentioned that “the activity gets properly cancelled”. How do you ensure this? The only way for activity to get properly canceled is to heartbeat and rethrow the ActivityCanceledException thrown from the heartbeat method from its body.
I think we should try to reproduce this issue with a minimal sample.
Slightly orthogonal but important question. If you want to limit the number of activities that you run, why don’t you use setMaxConcurrentActivityExecutionSize instead? Limiting the number of workflow task pollers doesn’t sound like a right approach.
My Apologies for pasting the wrong parameter into the question. We do indeed set MaxConcurrentActivityExecutionSize to 1 and do not touch the MaxConcurrentWorkflowTaskPollers.
Yes, I have been sending an incremental counter using activity.RecordHeartbeat(ctx, myCounter) and the UI reports that the heartbeat counter is being received by the temporal server (shows it is being incremented).
I feel that there might be some misunderstanding here, should we follow up on slack/zoom and try to debug your issue together? Please ping me directly on our company slack and we can go from there.