We are running on temporal-java sdk 1.7.1 and service 1.12.0 and seeing intermittent issues with workflows not getting picked up by worker. We can see the worker is listening to the task queue from task-queues page in web but still worker skips some workflows and picks others randomly. The stuck workflow is not even picked up after restarting the worker pod. This is happening when there is no real load on the system. Processing no more than 5 requests per minute.
The poll threads are configured as -
Can you show the workflow history of these stuck workflows?
Are your workers reporting any errors?
Do you have SDK metrics set up?
Here is a worker tuning guide that might help as well.
It’s hard to tell what could be going on without getting some more info (wf history, metrics, errors if any, …)
The workflow history just show 2 events -
on the taskQueue where the worker is polling, I see no errors in worker logs or history service logs.
We haven’t setup the SDK metrics setup yet, any documentation you can point us for this.
Is your worker running? Is it polling the task queue thats set in your WorkflowOptions when your client requests workflow execution? Is the WorkflowClient you pass to WorkerFactory using the same namespace as the WorkflowClient you use to start exec (WorkflowClientOptions->setNamespace)?
Here is a java metrics sample you can use to get started.
Yes correct, the worker is running as it does process other requests and we can see it running from the taskqueues page from temporal web. Yes the client and worker both use the same namespace. So for example if the client starts 10 workflows, 1 or 2 workflow get stuck like this others are picked up by worker and get processed just fine.
The frequency of these calls are not more than 1 call per second.
we enabled the java metrics but can’t find any data which would help us with this issue.
Anything else you think we should tune? as this seems to be failing very basic use case
There are some workflow task processing metrics that could be helpful:
here is a Grafana query that you could use:
sum by (temporal_namespace, workflow_type) (rate(temporal_workflow_task_queue_poll_succeed[5m]))
histogram_quantile(0.95, sum(rate(temporal_workflow_task_replay_latency_bucket[5m])) )
Might also look into
There are similar metrics for activity tasks as well. Hope this helps.
From the workflow history you descried it looks as you didn’t register the workflow type with the workers, but you are saying that’s not the case.
worker skips some workflows and picks others randomly
of same or different workflow type?
could you share the workflow history as well?
@tihomir - Thanks. After looking at the metrics you suggested and tweaking the workerOptions we no more see this issue.