Worker Configuration

We are deploying a worker that host an activity whose MaxConcurrentWorkflowTaskPollers is set to 1. We do this because this specific activity is memory intensive and we do not want any other activity gets executed when another one is already running for another workflow. The activity sends a heartbeat every few seconds and it is cancelable.

Now the problem is that if the workflow gets cancelled, the activity gets properly cancelled but the worker will never be assigned to another waiting workflows (who needs the same activity). The weird part is that if when we look at the “task-queues” for the UI Pollers page, it shows the ACTIVITY HANDLER checked and reports the activity worker to be available however no new work is being assigned to this worker whose activity just got cancelled. The situation stays the same until temporal server completely stop reporting the activity worker in the Pollers (after maybe 10-15 min) however the worker process is alive and looks healthy.

Also, we see a similar situation if the activity takes a long time to complete (like an hour). At that point, the activity finishes up, then the workflow gets completed but no more work will be assigned to that worker anymore.

During the time that the activity is running, the last event is ActivityTaskScheduled and the state is PENDING_ACTIVITY_STATE_STARTED. When the activity finishes up, then the ActivityTaskStarted shows up in the ui with following events.

1> What could be the cause of this?

2> Is the anyway for the worker to check its status with the temporal server and re-register itself with the server or something?

We are running on the latest Master branch.

^ this is an expected behavior
PENDING_ACTIVITY_STATE_STARTED means activity is in pending state, the state is started

are there other worker polling from the same queue / are there any new activity to be executed by worker?

please try not to use the master branch, ever
latest release version is 1.6.3

no, this is the only worker polling from this queue.(we want to make sure a single worker works as expected first before scaling this up)

yes, there are a whole bunch of other workflows that need the same activity on this worker. The worker is supposed to be assigned to them after it is cancelled but it is not

You mentioned that “the activity gets properly cancelled”. How do you ensure this? The only way for activity to get properly canceled is to heartbeat and rethrow the ActivityCanceledException thrown from the heartbeat method from its body.

Here is the relevant sample.

Our code base is go. We followed this example.

We do call activity.RecordHeartbeat(ctx, …) from the activity but stop sending the hearbeat once the activity is cancelled.

would you mind using our public slack channel so we can quickly debug this?

I think we should try to reproduce this issue with a minimal sample.
Slightly orthogonal but important question. If you want to limit the number of activities that you run, why don’t you use setMaxConcurrentActivityExecutionSize instead? Limiting the number of workflow task pollers doesn’t sound like a right approach.

+1 to @Vitaly point. MaxConcurrentWorkflowTaskPollers doesn’t limit at all the number of parallel workflows and activities running on a worker.

My Apologies for pasting the wrong parameter into the question. We do indeed set MaxConcurrentActivityExecutionSize to 1 and do not touch the MaxConcurrentWorkflowTaskPollers.

Have you confirmed that activity is handling the heartbeating correctly?

Yes, I have been sending an incremental counter using activity.RecordHeartbeat(ctx, myCounter) and the UI reports that the heartbeat counter is being received by the temporal server (shows it is being incremented).

I feel that there might be some misunderstanding here, should we follow up on slack/zoom and try to debug your issue together? Please ping me directly on our company slack and we can go from there.