Individual workflow metric

Hi experts,

We want to setup individual monitoring/alerting for a few business-critical Temporal workflows. What’s the best way to accomplish that?

We are already using the workflow_failed metric published by the server but we need a more fine-grained view into some of them.

Does it require a custom metric to be programatically emitted by each workflow implementation or is there another option?

Thanks in advance!

Both the Temporal Cluster and a Temporal SDK emit metrics.

Temporal SDK Metrics are documented here: SDK metrics | Temporal Documentation
And some Cluster metrics are loosely documented here: Temporal Server self-hosted production deployment | Temporal Documentation

For SDK metrics - where they are emitted to is controlled by the handler specified in Temporal Client options
Go How to set ClientOptions in Go | Temporal Documentation
TypeScript: Logging and Sinks in TypeScript SDK | Temporal Documentation

2 Likes

Just to add regarding

We are already using the workflow_failed metric published by the server but we need a more fine-grained view into some of them.

Yes, the server metric workflow_failed does not include the workflow type. You could instead use the
SDK temporal_workflow_failed metric which does include workflow_type param you can set in Grafana queries to the specific workflows you want to monitor.

1 Like

Thanks @tihomir, that’s exactly what I was looking for.

The only issue remaining is it seems our client code is not emitting temporal_workflow_failed metric. We’re using Java SDK. Is there any config I might be missing?

Alternatively, can I use workflow_failed metric with taskqueue param?

The only issue remaining is it seems our client code is not emitting temporal_workflow_failed metric. We’re using Java SDK. Is there any config I might be missing?

This should be emitted when a workflow execution fails by your workers if you configured sdk metrics with workers.
Using Java SDK it’s a counter so you should see:
temporal_workflow_failed_total{ .... } 1.0
logged in your Prometheus metrics endpoint for example.

Alternatively, can I use workflow_failed metric with taskqueue param?

Yes it does have task_queue param that you can use in your Grafana queries.

1 Like