Hi everyone,
I’m currently integrating Temporal with OpenTelemetry and Datadog for distributed tracing. I’ve implemented a custom sampler in Go that is intended to always sample and log error spans. The sampler checks for error indicators—specifically, an "error"
attribute or an "otel.status_code"
attribute set to Error
—and logs the span details when an error is detected.
My goal is to ensure that all error traces are properly logged and visible in Datadog’s temporal log view. However, despite the sampler correctly sampling these error spans, I’m not seeing the expected error logs in Datadog.
Has anyone encountered this issue or could share insights on best practices for logging error spans in the context of Temporal and OpenTelemetry? Are there specific configurations or recommended approaches—either within Temporal or Datadog’s exporter settings—that might be necessary to capture these error details?
Any guidance or pointers to relevant documentation would be greatly appreciated!
Thanks in advance for your help.
package sampling
import (
"fmt"
"log"
"go.opentelemetry.io/otel/codes"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
)
type CustomSampler struct {
defaultSampler sdktrace.Sampler
}
func NewCustomSampler(defaultRatio float64) sdktrace.Sampler {
errorAwareSampler := &CustomSampler{
defaultSampler: sdktrace.TraceIDRatioBased(defaultRatio),
}
return sdktrace.ParentBased(
errorAwareSampler,
sdktrace.WithRemoteParentSampled(errorAwareSampler),
sdktrace.WithRemoteParentNotSampled(errorAwareSampler),
sdktrace.WithLocalParentSampled(errorAwareSampler),
sdktrace.WithLocalParentNotSampled(errorAwareSampler),
)
}
func (s *CustomSampler) ShouldSample(p sdktrace.SamplingParameters) sdktrace.SamplingResult {
for _, attr := range p.Attributes {
if attr.Key == "error" || (attr.Key == "otel.status_code" && attr.Value.AsString() == codes.Error.String()) {
return sdktrace.SamplingResult{
Decision: sdktrace.RecordAndSample,
Attributes: p.Attributes,
}
}
}
return s.defaultSampler.ShouldSample(p)
}
func (s *CustomSampler) Description() string {
return fmt.Sprintf("CustomSampler(errors=100%%, other=%s)", s.defaultSampler.Description())
}
My goal is to ensure that all error traces are properly logged and visible in Datadog’s temporal log view. However, despite the sampler correctly sampling these error spans, I’m not seeing the expected error logs in Datadog.
Which errors do you expect?
Would you be able to provide a repro sample?
Hi @Kevin_Woo,
Regarding the Temporal Otel trace configuration — I expect to see all error traces, not just a percentage of them. For example, if I set the samplingRate
to 10% and there are 10 errors, I still want to see all 10 error traces, not just one.
Can you share how you’re using your sampler with Temporal SDK and how the the trace logs are configured to be sent to Datadog?
Note, Datadog contributed a interceptor (and sample-go/datadog) to generate deterministic SpanIds to handle long running workflows and workflow Replay across machines. OTel is not able to be deterministic.
func InitializeGlobalGrpcTracerProvider(ctx context.Context, cfg *Config) (*Provider, error) {
if err := validateConfig(cfg); err != nil {
return nil, fmt.Errorf("validate trace provider config: %w", err)
}
exp, err := newOtlpTraceExporter(ctx, cfg)
if err != nil {
return nil, fmt.Errorf("create otlp trace grpc exporter: %w", err)
}
bsp := sdktrace.NewBatchSpanProcessor(exp,
sdktrace.WithBatchTimeout(defaultBatchTimeout),
sdktrace.WithMaxQueueSize(defaultMaxQueueSize),
sdktrace.WithMaxExportBatchSize(defaultMaxExportBatchSize),
)
sampler := cfg.Sampler
if sampler == nil {
sampler = sdktrace.AlwaysSample()
}
res, err := newResource(ctx, cfg)
if err != nil {
return nil, fmt.Errorf("create resource: %w", err)
}
tp := sdktrace.NewTracerProvider(
sdktrace.WithSpanProcessor(bsp),
sdktrace.WithSampler(sampler),
sdktrace.WithResource(res),
)
otel.SetTracerProvider(tp)
otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
propagation.TraceContext{},
propagation.Baggage{},
))
return &Provider{
tp: tp,
config: cfg,
}, nil
}
// getSamplingRate converts a string sampling ratio to a float64 rate.
// The input can be either:
// - A ratio between 0 and 1 (e.g., "0.1" for 10% sampling)
// - A percentage between 0 and 100 (e.g., "10" for 10% sampling)
// Returns defaultSamplingRatio if the input is invalid.
func getSamplingRate(samplingRatio string) float64 {
if samplingRatio == "" {
return defaultSamplingRatio
}
v, err := strconv.ParseFloat(samplingRatio, 64)
if err != nil {
config.Logger.Debug("invalid sampling ratio",
zap.String("value", samplingRatio),
zap.Error(err))
return defaultSamplingRatio
}
switch {
case v == 0:
config.Logger.Debug("sampling ratio cannot be zero")
return defaultSamplingRatio
case v > 0 && v <= 1:
// Input is already a ratio (including 1), use as is
return v
case v > 1 && v <= 100:
// Input is a percentage, convert to ratio
return v / percentageFactor
default:
config.Logger.Debug("sampling ratio out of range",
zap.Float64("value", v))
return defaultSamplingRatio
}
}
Thanks, taking a look. Another question popped up while thinking through this, are you looking to try to capture SDK errors or it’s just errors thrown from your Workflow and Activities?
I don’t believe the SDKs are setup to emit spans, so you’ll only get results from the stuff you instrument, but I’ll also double check this.
Also are you using Datadog’s OTel SDK, or just plain OTel?
I suspect your ShouldSample()
method is not correctly matching errors so that they show up 100% of the time, instead they’re hitting the TraceIDRatioBased
sampler.
For Datadog’s error key matches, they are using these keys, I think specifically the errors are keyed by error.message
instead of just error
.
For OTel’s specific, I believe it’s keyed as exception.message
(v1.33.0).
Try doing a strings.Contains(attr.Key, "error")
and strings.Contains(attr.Key, "exception")
to see if that gets you into that condition?