Checking worker is running

Hi all,
I want to notify myself if Worker is not running. What is the best way to check the health of workers?

In Python, if a worker has not returned from run() it is running. Internally there can be issues with server communication and such, but we retry internally unless/until they reach a state of fatal and we’ll fail the run() call. You can use metrics to monitor performance: Developer's guide - Worker performance | Temporal Documentation.

So there is no way to check is workers are available? I have multiple of them running on bare metal. Maybe there is some kind of method that returns available workers or something like that?
I was thinking about implementing a watchdog for my workers. So I can be reached out by it and fix it asap without looking at metrics etc.

Can you clarify what available means here? It is your code/process that invokes worker.run() so you should be able to know whether that has returned in your code (or you can expose that information remotely to your other systems). Technically the advanced API for describing a task exists that can list workers, but it is eventually consistent (i.e. not always accurate).

In a recent example, one of my bare metal workers lost a network connection but the other one was still running stable but it was a reliability issue. I was digging into the client because I was thinking that Temporal knows a list of workers when it can send workflows. In my head, I pictured this:
Temporal knows the number of workers when it sends workflows, one worker is down for whatever reason, temporal marks it in a list of workers with the label unreachable or whatever.

My thoughts were that I can run a separate workflow that will monitor workers availability.
If you know a better solution for that, please, share.

Temporal does not always know the workers you have running, but you do :slight_smile: Temporal only knows when a worker is asking for work. A worker may be busy processing (or reached max concurrent or whatever) and not ask for work again until it wants to. Having said that, Temporal does have an eventually-consistent API to know whether workers have asked for work in a while, but that’s not the best way to know if your workers that you started are still running.

Yes, you should monitor your worker process like any other Python process. If the process just runs await worker.run() then the process will fail when it fatals or be running while the worker is running. Do the same type of daemon monitoring you might for any other long-running Python thing. Don’t rely on Temporal server to relay what you have running.

In Temporal admin panel, we have “workers” that indicate the number of workers available.
Can I somehow reach that data?

Yes, but it is the eventually consistent (i.e. sometimes stale) API method I spoke of. This is the advanced DescribeTaskQueue. You can invoke in Python via await my_client.workflow_service.describe_task_queue(temporalio.api.workflowservice.v1.DescribeTaskQueueRequest(namespace="...", task_queue=temporalio.api.taskqueue.v1.TaskQueue(name="..."))) or something similar. This is advanced API so you are invoking gRPC directly because we do not have a high-level Python wrapper for it.

2 Likes

Couple useful metrics from service side if it helps:

Per-namespace poller counts (workflow/activity pollers). If this goes to 0 (or down and you don’t have auto-scaling) can be a pretty good indication your workers stoped polling cause down or are at capacity and no longer polling.
sum(service_pending_requests{service_name="frontend"}) by (namespace)

Task queue (matching service) backlog
histogram_quantile(0.95, sum(rate(service_latency_bucket{operation=~"PollActivityTaskQueue|PollWorkflowTaskQueue"}[5m])) by (operation, le))

If you see latencies go up can be indication of workers stopped/are not polling as well as can be indication of not having enough workers for generated workload.

Sync match rate:
sum(rate(poll_success_sync{}[1m])) / sum(rate(poll_success{}[1m]))

If drops to 0 can mean worker down/no longer polling. Can be filtered by type (workflow tasks/activity tasks)
Related also on persistence side:
sum(rate(persistence_requests{operation="CreateTask"}[1m]))
This should go up (tasks written to db) only when there are no pollers available to pick up tasks from matching task queues

2 Likes

If I am getting an empty response for describe_task_queue call what does it mean?
My code looks like this

   from temporalio.api.workflowservice import v1

    result = await client.workflow_service.describe_task_queue(
        v1.DescribeTaskQueueRequest(
            namespace="namespace",
            task_queue=api.taskqueue.v1.TaskQueue(name="task_queue"),
        )
    )

result would be the protobuf form of DescribeTaskQueueResponse. The API is a bit advanced, but you can use the pollers collection there to see what workers have polled “recently”.

The recommendation is still to keep track of your workers you start if you want to know which ones are running as this only shows polled recently.

I am also looking for the same mechanism. I tried the DescribeTaskQueue but it is returning empty pollers.

@Chad_Retz The following code worked for me:

result = await client.workflow_service.describe_task_queue(
        v1.DescribeTaskQueueRequest(
            namespace=TEMPORAL_CONSTANTS.NAMESPACE,
            include_task_queue_status=True,
            task_queue_type=TaskQueueType.TASK_QUEUE_TYPE_WORKFLOW,
            task_queue=TaskQueue(name=TEMPORAL_CONSTANTS.TASK_QUEUE, kind=TaskQueueKind.TASK_QUEUE_KIND_NORMAL),
        )
    )

    print(result.pollers)

I have a couple of questions:

  • What is the default Worker Polling Interval?
  • How long will the worker show up in the poller’s result? I am asking in case of a Worker Shutdown.

The poller set on that call is eventually consistent, and just because a worker isn’t polling doesn’t mean it’s not running, it just may be “full”.

Worker polling times out at around a minute.

I am not exactly sure, would have to check. But you should not rely on this eventually consistent method. I would recommend you notify yourself if you are calling shutdown or the worker returns from “run”. That is a better method to know about your worker instead of relying on this value from the server. Worker metrics can also be used to determine amount of work moving through.

@Chad_Retz we’re implementing this to work around [Bug] Worker.finalize_shutdown seems to hang when poll never succeeded due to server permission failure · Issue #667 · temporalio/sdk-core · GitHub, so in our case we’re not actually calling shutdown, nor is run() returning.

You may have to wait on bug completion, but in the meantime, you may need to look at logs/metrics to determine whether you are having poll failures.

Which metrics can we use to determine polling failures? Is there any API available?

The SDK metrics are listed here. In this case, temporal_long_request_failure should increment on each failure. Here is a sample showing how to configure Prometheus metrics, and here is one for OpenTelemetry (though that includes tracing part too which you may not need).

Thanks, @Chad_Retz. Do you have an idea of when that Worker shutdown bug will be fixed? It’ll help us decide if it’s worth the effort of enabling metrics in the SDK, writing a liveness probe that queries Prometheus, etc. vs just waiting for the fix.

I am afraid I do not have a timeline, sorry. We will look into prioritizing it. But you will want metrics available all the time anyways. Permission denial (as a result of invalid client options for a worker) is but one of many reasons a poll call could fail.