Workers stop listening to the queue intermittently

We have observed that sometimes the workers go missing from the queue. Because of which when we try to run the query methods in the UI, we get ‘No Active workers in the queue’ error.

Questions:

  1. Do workers sleep after say 2 hours of no tasks in the task queue? Could that be causing the problem
    or
  2. If the workers stops listening to the queue, could it be a networking issue between worker and temporal server?
1 Like

Worker polls can time out if there are no tasks on the task queue they are polling for an amount of time(default 60s).
WorkflowServiceStubsOptions->DEFAULT_GRPC_RECONNECT_FREQUENCY has a default value of 1 minute so your polls should reconnect after that time.
I would look at possible issues with this reconnect on your side. Anything in the logs?

I feel that, workers polling for every 60s works fine for executing workflow tasks,

However, in case of query methods, as they are synchronous, would there be a chance that,
all the workers polled and went to sleep for 60s. Right then, a query method came in and there were no workers to fulfill and the query method timed out?

What are the side effects of reducing this poll duration to ensure query methods are fulfilled always?

  1. Do workers sleep after say 2 hours of no tasks in the task queue? Could that be causing the problem
    or

No, workers never sleep. They keep long-polling the task queue until they are shut down.

I feel that, workers polling for every 60s works fine for executing workflow tasks,

However, in case of query methods, as they are synchronous, would there be a chance that,
all the workers polled and went to sleep for 60s. Right then, a query method came in and there were no workers to fulfill and the query method timed out?

What are the side effects of reducing this poll duration to ensure query methods are fulfilled always?

A worker uses long-poll. It makes a poll call which blocks for 60 seconds. If no task is received during this time then the call returns an empty result. Then a new call is done immediately. If a task is received while a poll is waiting then it is delivered immediately. So long poll has a very good latency and doesn’t put too much load on the SDK and the service as it is executed every 60 seconds.