How can we get alerted about non deterministic or panic error

Hi we are currently looking for a way to get alerted when there is non deterministic or panic errors happening in our workflows. Is there, for example, any metrics emitted from the sdk on this error that we can then use e.g. with datadog monitors? I’m looking at this Temporal SDK metrics reference | Temporal Documentation but there doesn’t seem to be any?


You could monitor workflow task failure (counter metric): Temporal SDK metrics reference | Temporal Documentation

There is feature request here that you can follow, it aims to add more details to this metrics that will be helpful.

may I also chime in and suggest that you set up replay testing pointing at your workflows in production? I found that the best way is to catch these errors early in CI, so you can plan around them before they hit production

replay testing in Go
replay testing in TS