Hi everyone!
We’re currently running our .NET workers in Kubernetes and using the OpenTelemetry Collector to handle all our traces, metrics, and logs. These are then exported to Elastic via APM. Most of the data shows up correctly in Elastic, and we’re able to follow our most of our workflows as expected.
However, we’re running into an issue when an activity fails/errors in general. We can see the ApplicationFailureException, which includes all the details we need to debug the actual error. The problem is that we can’t see which workflow id/run the error is associated with.
The only lead we have is the Elastic label error.id, which looks like a UUID. I’m not sure whether this ID is generated by Elastic or by Temporal, and if it can somehow be used to trace back to the corresponding workflow.
To make things trickier, we have hundreds of thousands of active workflows, so it’s not practical to manually search for them in the Temporal dashboard — unless there’s a way to filter or find workflows with currently failing activities.
Has anyone worked with OpenTelemetry + Temporal (and maybe Elastic APM) and run into this issue before?
Any advice or configuration tips would be greatly appreciated!
Disclaimer: We’re still relatively new to OpenTelemetry, so it’s possible our setup isn’t fully correct.