Differentiating single workflow failures vs exhausted retry attempts in Temporal metrics

ouanixi · June 14, 2023, 2:38pm

Hello everyone,

have workflows that are set up to retry according to a particular policy. I’m finding it challenging to distinguish between single workflow failures and situations where all retry attempts have been exhausted, resulting in a workflow failure.

The SDK metric temporal_workflow_failed has been instrumental in tracking workflow failures, but it seems to conflate these two distinct scenarios. I would like to be able to differentiate between a single attempt failure and a scenario where all retry attempts have been exhausted, yet the workflow still fails.

Does anyone know if there’s a way to make this distinction, either by maybe adding a custom tag to the temporal_workflow_failed metric or via some other means? Any insights or alternative approaches to handle this problem would be greatly appreciated.

Thank you in advance for your help.

tihomir · June 19, 2023, 12:16am

Each workflow retry is a new executions, retries are handled by the service, and associated “workflow_failed” server metric also records each individual execution.

One thing you could do is to emit a custom metric, you can get workflow retry attempt
in code with
workflow.GetInfo(ctx).Attempt
and can get handle to metrics scope with:
workflow.GetMetricsHandler(ctx) to emit custom counter if attempt is the last one for example.

That being said we typically do not recommend setting a retry policy on workflow level. Can you describe your use case to see why it would be needed?

ouanixi · June 20, 2023, 11:18am

Thanks, @tihomir I ended up doing exactly what you described.
Having said that, it would have been awesome if it were possible to inject custom tags on the fly to the existing SDK metrics.

That being said we typically do not recommend setting a retry policy on workflow level. Can you describe your use case to see why it would be needed?

This is the way we’ve implemented rate limiting on a per custom-search-attribute basis.

The very first activity checks for the rate limit and the workflow would fail if the limit is exceeded. Which then gets retried. Hence the need for better tracking of rate limit errors vs other errors

Topic		Replies	Views
Unresolved Workflow Metrics (Workflows That Failed But Never Had A Successful Re-run) Community Support	0	246	March 1, 2024
How to get retry policy within activity Community Support go-sdk , activity	2	492	May 21, 2024
Individual workflow metric Community Support metrics	4	1451	March 30, 2022
Retry Attempt Visibility Community Support go-sdk , web-ui	1	393	June 12, 2023
Understanding workflow retries and failures Community Support go-sdk	3	9060	July 9, 2020

Differentiating single workflow failures vs exhausted retry attempts in Temporal metrics

Related topics