Checking worker is running

Isptn · August 10, 2023, 3:25pm

Hi all,
I want to notify myself if Worker is not running. What is the best way to check the health of workers?

Chad_Retz · August 10, 2023, 3:42pm

In Python, if a worker has not returned from run() it is running. Internally there can be issues with server communication and such, but we retry internally unless/until they reach a state of fatal and we’ll fail the run() call. You can use metrics to monitor performance: Developer's guide - Worker performance | Temporal Documentation.

Isptn · August 10, 2023, 3:49pm

So there is no way to check is workers are available? I have multiple of them running on bare metal. Maybe there is some kind of method that returns available workers or something like that?
I was thinking about implementing a watchdog for my workers. So I can be reached out by it and fix it asap without looking at metrics etc.

Chad_Retz · August 10, 2023, 4:00pm

Can you clarify what available means here? It is your code/process that invokes worker.run() so you should be able to know whether that has returned in your code (or you can expose that information remotely to your other systems). Technically the advanced API for describing a task exists that can list workers, but it is eventually consistent (i.e. not always accurate).

Isptn · August 10, 2023, 4:10pm

In a recent example, one of my bare metal workers lost a network connection but the other one was still running stable but it was a reliability issue. I was digging into the client because I was thinking that Temporal knows a list of workers when it can send workflows. In my head, I pictured this:
Temporal knows the number of workers when it sends workflows, one worker is down for whatever reason, temporal marks it in a list of workers with the label unreachable or whatever.

My thoughts were that I can run a separate workflow that will monitor workers availability.
If you know a better solution for that, please, share.

Chad_Retz · August 10, 2023, 4:16pm

Temporal does not always know the workers you have running, but you do Temporal only knows when a worker is asking for work. A worker may be busy processing (or reached max concurrent or whatever) and not ask for work again until it wants to. Having said that, Temporal does have an eventually-consistent API to know whether workers have asked for work in a while, but that’s not the best way to know if your workers that you started are still running.

Yes, you should monitor your worker process like any other Python process. If the process just runs await worker.run() then the process will fail when it fatals or be running while the worker is running. Do the same type of daemon monitoring you might for any other long-running Python thing. Don’t rely on Temporal server to relay what you have running.

Isptn · August 10, 2023, 4:23pm

In Temporal admin panel, we have “workers” that indicate the number of workers available.
Can I somehow reach that data?

Chad_Retz · August 10, 2023, 5:08pm

Yes, but it is the eventually consistent (i.e. sometimes stale) API method I spoke of. This is the advanced DescribeTaskQueue. You can invoke in Python via await my_client.workflow_service.describe_task_queue(temporalio.api.workflowservice.v1.DescribeTaskQueueRequest(namespace="...", task_queue=temporalio.api.taskqueue.v1.TaskQueue(name="..."))) or something similar. This is advanced API so you are invoking gRPC directly because we do not have a high-level Python wrapper for it.

tihomir · August 10, 2023, 6:05pm

Couple useful metrics from service side if it helps:

Per-namespace poller counts (workflow/activity pollers). If this goes to 0 (or down and you don’t have auto-scaling) can be a pretty good indication your workers stoped polling cause down or are at capacity and no longer polling.
sum(service_pending_requests{service_name="frontend"}) by (namespace)

Task queue (matching service) backlog
histogram_quantile(0.95, sum(rate(service_latency_bucket{operation=~"PollActivityTaskQueue|PollWorkflowTaskQueue"}[5m])) by (operation, le))

If you see latencies go up can be indication of workers stopped/are not polling as well as can be indication of not having enough workers for generated workload.

Sync match rate:
sum(rate(poll_success_sync{}[1m])) / sum(rate(poll_success{}[1m]))

If drops to 0 can mean worker down/no longer polling. Can be filtered by type (workflow tasks/activity tasks)
Related also on persistence side:
sum(rate(persistence_requests{operation="CreateTask"}[1m]))
This should go up (tasks written to db) only when there are no pollers available to pick up tasks from matching task queues

Isptn · August 10, 2023, 8:14pm

If I am getting an empty response for describe_task_queue call what does it mean?
My code looks like this

   from temporalio.api.workflowservice import v1

    result = await client.workflow_service.describe_task_queue(
        v1.DescribeTaskQueueRequest(
            namespace="namespace",
            task_queue=api.taskqueue.v1.TaskQueue(name="task_queue"),
        )
    )

Chad_Retz · August 10, 2023, 8:20pm

result would be the protobuf form of DescribeTaskQueueResponse. The API is a bit advanced, but you can use the pollers collection there to see what workers have polled “recently”.

The recommendation is still to keep track of your workers you start if you want to know which ones are running as this only shows polled recently.

rnmulchandani · April 2, 2024, 7:40am

I am also looking for the same mechanism. I tried the DescribeTaskQueue but it is returning empty pollers.

rnmulchandani · April 15, 2024, 4:59am

@Chad_Retz The following code worked for me:

result = await client.workflow_service.describe_task_queue(
        v1.DescribeTaskQueueRequest(
            namespace=TEMPORAL_CONSTANTS.NAMESPACE,
            include_task_queue_status=True,
            task_queue_type=TaskQueueType.TASK_QUEUE_TYPE_WORKFLOW,
            task_queue=TaskQueue(name=TEMPORAL_CONSTANTS.TASK_QUEUE, kind=TaskQueueKind.TASK_QUEUE_KIND_NORMAL),
        )
    )

    print(result.pollers)

I have a couple of questions:

What is the default Worker Polling Interval?
How long will the worker show up in the poller’s result? I am asking in case of a Worker Shutdown.

Chad_Retz · April 15, 2024, 12:46pm

The poller set on that call is eventually consistent, and just because a worker isn’t polling doesn’t mean it’s not running, it just may be “full”.

Worker polling times out at around a minute.

I am not exactly sure, would have to check. But you should not rely on this eventually consistent method. I would recommend you notify yourself if you are calling shutdown or the worker returns from “run”. That is a better method to know about your worker instead of relying on this value from the server. Worker metrics can also be used to determine amount of work moving through.

dano · April 15, 2024, 2:06pm

@Chad_Retz we’re implementing this to work around [Bug] Worker.finalize_shutdown seems to hang when poll never succeeded due to server permission failure · Issue #667 · temporalio/sdk-core · GitHub, so in our case we’re not actually calling shutdown, nor is run() returning.

Chad_Retz · April 15, 2024, 2:09pm

You may have to wait on bug completion, but in the meantime, you may need to look at logs/metrics to determine whether you are having poll failures.

rnmulchandani · April 15, 2024, 2:15pm

Which metrics can we use to determine polling failures? Is there any API available?

Chad_Retz · April 15, 2024, 2:22pm

The SDK metrics are listed here. In this case, temporal_long_request_failure should increment on each failure. Here is a sample showing how to configure Prometheus metrics, and here is one for OpenTelemetry (though that includes tracing part too which you may not need).

dano · April 15, 2024, 7:33pm

Thanks, @Chad_Retz. Do you have an idea of when that Worker shutdown bug will be fixed? It’ll help us decide if it’s worth the effort of enabling metrics in the SDK, writing a liveness probe that queries Prometheus, etc. vs just waiting for the fix.

Chad_Retz · April 15, 2024, 7:39pm

I am afraid I do not have a timeline, sorry. We will look into prioritizing it. But you will want metrics available all the time anyways. Permission denial (as a result of invalid client options for a worker) is but one of many reasons a poll call could fail.

Topic		Replies	Views
How do I get a list of / number of running workers from the Python SDK? Community Support	1	386	September 19, 2023
Is there a way to check the health of workers? Community Support java-sdk	2	2112	June 10, 2022
Can I get info on how long a task queue for a remote worker has been running for? Community Support python-sdk	0	120	April 10, 2024
Workers liveness monitoring Community Support	2	74	June 30, 2024
How to detect a cron run timeout and alert? Community Support	4	1914	November 13, 2020

Checking worker is running

Related Topics