Hi we are currently looking for a way to get alerted when there is non deterministic or panic errors happening in our workflows. Is there, for example, any metrics emitted from the sdk on this error that we can then use e.g. with datadog monitors? I’m looking at this Temporal SDK metrics reference | Temporal Documentation but there doesn’t seem to be any?
may I also chime in and suggest that you set up replay testing pointing at your workflows in production? I found that the best way is to catch these errors early in CI, so you can plan around them before they hit production