Worker not picking scheduled WorkflowTask

atullimaye19 · February 24, 2022, 7:00am

We are running on temporal-java sdk 1.7.1 and service 1.12.0 and seeing intermittent issues with workflows not getting picked up by worker. We can see the worker is listening to the task queue from task-queues page in web but still worker skips some workflows and picks others randomly. The stuck workflow is not even picked up after restarting the worker pod. This is happening when there is no real load on the system. Processing no more than 5 requests per minute.

The poll threads are configured as -
setWorkflowPollThreadCount(5).
setActivityPollThreadCount(10).

tihomir · February 24, 2022, 1:55pm

Can you show the workflow history of these stuck workflows?
Are your workers reporting any errors?
Do you have SDK metrics set up?
Here is a worker tuning guide that might help as well.

It’s hard to tell what could be going on without getting some more info (wf history, metrics, errors if any, …)

atullimaye19 · February 24, 2022, 3:11pm

The workflow history just show 2 events -

WorkflowExecutionStarted-

WorkflowTaskScheduled -
startToCloseTimeout 10s
attempt1

on the taskQueue where the worker is polling, I see no errors in worker logs or history service logs.
We haven’t setup the SDK metrics setup yet, any documentation you can point us for this.

tihomir · February 24, 2022, 3:18pm

Is your worker running? Is it polling the task queue thats set in your WorkflowOptions when your client requests workflow execution? Is the WorkflowClient you pass to WorkerFactory using the same namespace as the WorkflowClient you use to start exec (WorkflowClientOptions->setNamespace)?

Here is a java metrics sample you can use to get started.

atullimaye19 · February 24, 2022, 3:57pm

Yes correct, the worker is running as it does process other requests and we can see it running from the taskqueues page from temporal web. Yes the client and worker both use the same namespace. So for example if the client starts 10 workflows, 1 or 2 workflow get stuck like this others are picked up by worker and get processed just fine.
The frequency of these calls are not more than 1 call per second.

atullimaye19 · February 28, 2022, 10:50pm

we enabled the java metrics but can’t find any data which would help us with this issue.
Anything else you think we should tune? as this seems to be failing very basic use case

tihomir · February 28, 2022, 11:12pm

There are some workflow task processing metrics that could be helpful:

temporal_workflow_task_queue_poll_succeed
here is a Grafana query that you could use:
sum by (temporal_namespace, workflow_type) (rate(temporal_workflow_task_queue_poll_succeed[5m]))
temporal_workflow_task_schedule_to_start_latency
sample query:
histogram_quantile(0.95, sum(rate(temporal_workflow_task_schedule_to_start_latency_bucket[5m]))
temporal_workflow_task_execution_failed
sample query:
rate(temporal_workflow_task_execution_failed[5m])

4. temporal_workflow_task_execution_latency
histogram_quantile(0.95, sum(rate(temporal_workflow_task_execution_latency_bucket[5m])))

temporal_workflow_task_replay_latency
histogram_quantile(0.95, sum(rate(temporal_workflow_task_replay_latency_bucket[5m])) )

Might also look into temporal_sticky_cache_total_forced_eviction
sample query:
rate(temporal_sticky_cache_total_forced_eviction[5m]))

There are similar metrics for activity tasks as well. Hope this helps.

tihomir · February 28, 2022, 11:33pm

From the workflow history you descried it looks as you didn’t register the workflow type with the workers, but you are saying that’s not the case.

worker skips some workflows and picks others randomly

of same or different workflow type?

could you share the workflow history as well?

atullimaye19 · April 8, 2022, 12:40am

@tihomir - Thanks. After looking at the metrics you suggested and tweaking the workerOptions we no more see this issue.

setWorkflowCacheSize(1200)
setMaxWorkflowThreadCount(2400)
setMaxConcurrentLocalActivityExecutionSize(1000)
setMaxConcurrentActivityExecutionSize(1000)
setMaxConcurrentWorkflowTaskExecutionSize(1000)

wineandcheeze · June 18, 2024, 8:37pm

Hello folks,
I am running into the same issue. We are using temporalio/server image 1.21.4, and a worker is running with java SDK 1.23.0.
We triggered the same workflow twice, the first time it’s executed instantly, but the second one is kind of ignored and nothing happened on worker side after more than 1 minute (we end up killing it in the end). The worker logs are not showing anything after the first workflow execution completion.
Does it ring a bell?
Does

wineandcheeze · June 19, 2024, 2:16pm

It turns out that probably it’s just the worker is lagging to poll the events and we killed the workflow too soon. We tried a second time, run a workflow right after its first execution finishes, and it takes ~ 3 minutes for the worker to take it (the first run is almost instantaneous).
Also, during the waiting, at some point temporal ui warns that there’s no worker available for the workflow while in reality it’s always up.
Do you have some thoughts why the latency of polling could be so different between 2 consecutive runs? Knowing that both temporal server and worker are pretty idle. Also, the reason why temporal sees no worker at all at some point?

Domonion · October 25, 2024, 12:36pm

encountered the same error as you report, temporal server 1.23.1 and java sdk 1.23.0
will post update on monday

Rohit_J · December 20, 2024, 7:19am

We encountered the same issue where the activity was not picked up by the worker, even after restarting it. However, when we described the specific task queue, the worker began processing the requests. The worker is configured with a maximum of one concurrent activity and does not handle any workflows.
temporal java SDK: 1.23.2
temporal server: 1.25.1

tihomir · December 22, 2024, 9:53am

Also, during the waiting, at some point temporal ui warns that there’s no worker available for the workflow while in reality it’s always up.

look at temporal_worker_task_slots_available metric, filter by worker_type, see if numbers are showing 0 for certain worker_type for more a period of time (typically 2 mins as describeworkflowexec api is a 2 min sliding window)

worker will not poll activity tasks for example if its currently processing max concurrent ones you define in worker options. if connection(s) you establish in activity code start blocking unexpectedly it could cause your activities to time out and the timed out activities not completing on your worker. its situation you need to monitor (temporal_worker_task_slots_available metric, and also can look at temporal_request_failure metric by operation, status code) and detect.

However, when we described the specific task queue, the worker began processing the requests.

dont think its related. describing task queue is a client api, doesnt have to do with your workers

The worker is configured with a maximum of one concurrent activity and does not handle any workflows.

while your worker is processing activity task, it would not poll for any more (does not have capacity to do so per your configuration). would also check temporal_worker_task_slots_available.
in your case you also need single activity task poller (makes no sense to have more) and also in some cases makes sense to reduce num of task queue read+write partitions on this activity task queue to 1

Topic		Replies	Views
How To Identify And Tune Worker Bottlenecks Community Support java-sdk	2	1942	January 23, 2023
Workflow Performance with Java SDK Community Support java-sdk	1	762	February 20, 2023
Very big schedule to start workflow latency (Java SDK) Community Support java-sdk	10	3085	March 22, 2024
Stuck workflows Community Support server	1	299	March 8, 2024
Workflow Task Schedule To Start Latency High Community Support java-sdk , deployment	11	4087	February 8, 2025

Worker not picking scheduled WorkflowTask

Related topics