Inconsistent/Elevated SDK Metrics (Python SDK)

Hi there,

My team is working with Temporal via the Python SDK and is piping SDK metrics into Datadog. We’re trying to set up monitoring for workflow failures via the temporal_workflow_failed metric, but are seeing strange/unexpected behavior. For ~70 failed workflows in the past 12 hours, we are seeing the metric register a count in the neighborhood of 7000.

The workflow itself is fairly simple - 1 activity that retries once, no retries of the workflow itself. The metric is being filtered to this exact workflow_type and operating environment. The SDK documentation reads to me like it should only be registering 1 increment for a workflow reaching failed state. Any ideas that might help us chase down the source of this issue or similar experiences are appreciated!

Hi,

not sure if this is the issue, but with Datadog you need to configure the OpenTelemetry to use DELTA temporality instead of CUMULATIVE

metrics=OpenTelemetryConfig(....  
     metric_temporality=OpenTelemetryMetricTemporality.DELTA
),

Thanks for this! I’ll try this out and see if it helps!

Just following up to confirm that this shift to DELTA temporality cleared up our metrics issues here! Thanks so much for the assistance!