Hi there,
My team is working with Temporal via the Python SDK and is piping SDK metrics into Datadog. We’re trying to set up monitoring for workflow failures via the temporal_workflow_failed metric, but are seeing strange/unexpected behavior. For ~70 failed workflows in the past 12 hours, we are seeing the metric register a count in the neighborhood of 7000.
The workflow itself is fairly simple - 1 activity that retries once, no retries of the workflow itself. The metric is being filtered to this exact workflow_type and operating environment. The SDK documentation reads to me like it should only be registering 1 increment for a workflow reaching failed state. Any ideas that might help us chase down the source of this issue or similar experiences are appreciated!