Differentiating single workflow failures vs exhausted retry attempts in Temporal metrics

Hello everyone,

have workflows that are set up to retry according to a particular policy. I’m finding it challenging to distinguish between single workflow failures and situations where all retry attempts have been exhausted, resulting in a workflow failure.

The SDK metric temporal_workflow_failed has been instrumental in tracking workflow failures, but it seems to conflate these two distinct scenarios. I would like to be able to differentiate between a single attempt failure and a scenario where all retry attempts have been exhausted, yet the workflow still fails.

Does anyone know if there’s a way to make this distinction, either by maybe adding a custom tag to the temporal_workflow_failed metric or via some other means? Any insights or alternative approaches to handle this problem would be greatly appreciated.

Thank you in advance for your help.

Each workflow retry is a new executions, retries are handled by the service, and associated “workflow_failed” server metric also records each individual execution.

One thing you could do is to emit a custom metric, you can get workflow retry attempt
in code with
workflow.GetInfo(ctx).Attempt
and can get handle to metrics scope with:
workflow.GetMetricsHandler(ctx) to emit custom counter if attempt is the last one for example.

That being said we typically do not recommend setting a retry policy on workflow level. Can you describe your use case to see why it would be needed?

Thanks, @tihomir I ended up doing exactly what you described.
Having said that, it would have been awesome if it were possible to inject custom tags on the fly to the existing SDK metrics.

That being said we typically do not recommend setting a retry policy on workflow level. Can you describe your use case to see why it would be needed?

This is the way we’ve implemented rate limiting on a per custom-search-attribute basis.

The very first activity checks for the rate limit and the workflow would fail if the limit is exceeded. Which then gets retried. Hence the need for better tracking of rate limit errors vs other errors