Worker not picking scheduled WorkflowTask

We are running on temporal-java sdk 1.7.1 and service 1.12.0 and seeing intermittent issues with workflows not getting picked up by worker. We can see the worker is listening to the task queue from task-queues page in web but still worker skips some workflows and picks others randomly. The stuck workflow is not even picked up after restarting the worker pod. This is happening when there is no real load on the system. Processing no more than 5 requests per minute.

The poll threads are configured as -
setWorkflowPollThreadCount(5).
setActivityPollThreadCount(10).

Can you show the workflow history of these stuck workflows?
Are your workers reporting any errors?
Do you have SDK metrics set up?
Here is a worker tuning guide that might help as well.

It’s hard to tell what could be going on without getting some more info (wf history, metrics, errors if any, …)

The workflow history just show 2 events -

WorkflowExecutionStarted-

WorkflowTaskScheduled -
startToCloseTimeout 10s
attempt1

on the taskQueue where the worker is polling, I see no errors in worker logs or history service logs.
We haven’t setup the SDK metrics setup yet, any documentation you can point us for this.

Is your worker running? Is it polling the task queue thats set in your WorkflowOptions when your client requests workflow execution? Is the WorkflowClient you pass to WorkerFactory using the same namespace as the WorkflowClient you use to start exec (WorkflowClientOptions->setNamespace)?

Here is a java metrics sample you can use to get started.

Yes correct, the worker is running as it does process other requests and we can see it running from the taskqueues page from temporal web. Yes the client and worker both use the same namespace. So for example if the client starts 10 workflows, 1 or 2 workflow get stuck like this others are picked up by worker and get processed just fine.
The frequency of these calls are not more than 1 call per second.

we enabled the java metrics but can’t find any data which would help us with this issue.
Anything else you think we should tune? as this seems to be failing very basic use case

There are some workflow task processing metrics that could be helpful:

  1. temporal_workflow_task_queue_poll_succeed
    here is a Grafana query that you could use:
    sum by (temporal_namespace, workflow_type) (rate(temporal_workflow_task_queue_poll_succeed[5m]))

  2. temporal_workflow_task_schedule_to_start_latency
    sample query:
    histogram_quantile(0.95, sum(rate(temporal_workflow_task_schedule_to_start_latency_bucket[5m]))

  3. temporal_workflow_task_execution_failed
    sample query:
    rate(temporal_workflow_task_execution_failed[5m])

4. temporal_workflow_task_execution_latency
histogram_quantile(0.95, sum(rate(temporal_workflow_task_execution_latency_bucket[5m])))

  1. temporal_workflow_task_replay_latency
    histogram_quantile(0.95, sum(rate(temporal_workflow_task_replay_latency_bucket[5m])) )

Might also look into temporal_sticky_cache_total_forced_eviction
sample query:
rate(temporal_sticky_cache_total_forced_eviction[5m]))

There are similar metrics for activity tasks as well. Hope this helps.

1 Like

From the workflow history you descried it looks as you didn’t register the workflow type with the workers, but you are saying that’s not the case.

worker skips some workflows and picks others randomly

of same or different workflow type?

could you share the workflow history as well?

@tihomir - Thanks. After looking at the metrics you suggested and tweaking the workerOptions we no more see this issue.

setWorkflowCacheSize(1200)
setMaxWorkflowThreadCount(2400)
setMaxConcurrentLocalActivityExecutionSize(1000)
setMaxConcurrentActivityExecutionSize(1000)
setMaxConcurrentWorkflowTaskExecutionSize(1000)

1 Like

Hello folks,
I am running into the same issue. We are using temporalio/server image 1.21.4, and a worker is running with java SDK 1.23.0.
We triggered the same workflow twice, the first time it’s executed instantly, but the second one is kind of ignored and nothing happened on worker side after more than 1 minute (we end up killing it in the end). The worker logs are not showing anything after the first workflow execution completion.
Does it ring a bell?
Does

It turns out that probably it’s just the worker is lagging to poll the events and we killed the workflow too soon. We tried a second time, run a workflow right after its first execution finishes, and it takes ~ 3 minutes for the worker to take it (the first run is almost instantaneous).
Also, during the waiting, at some point temporal ui warns that there’s no worker available for the workflow while in reality it’s always up.
Do you have some thoughts why the latency of polling could be so different between 2 consecutive runs? Knowing that both temporal server and worker are pretty idle. Also, the reason why temporal sees no worker at all at some point?

1 Like