During periods of high load, we are seeing some workflows being stuck at WorkflowTaskScheduled
for multiple days despite having active workers on the queues.
Looking into our datadog logs for the workflow, we are seeing this error:
Persistent store operation Failure
with the following trace:
go.temporal.io/server/common/log.(*zapLogger).Error
/home/builder/temporal/common/log/zap_logger.go:156
go.temporal.io/server/service/history/workflow.createWorkflowExecution
/home/builder/temporal/service/history/workflow/transaction_impl.go:381
go.temporal.io/server/service/history/workflow.(*ContextImpl).CreateWorkflowExecution
/home/builder/temporal/service/history/workflow/context.go:334
go.temporal.io/server/service/history/api/startworkflow.(*Starter).createBrandNew
/home/builder/temporal/service/history/api/startworkflow/api.go:250
go.temporal.io/server/service/history/api/startworkflow.(*Starter).Invoke
/home/builder/temporal/service/history/api/startworkflow/api.go:178
go.temporal.io/server/service/history.(*historyEngineImpl).StartWorkflowExecution
/home/builder/temporal/service/history/history_engine.go:358
go.temporal.io/server/service/history.(*Handler).StartWorkflowExecution
/home/builder/temporal/service/history/handler.go:588
go.temporal.io/server/api/historyservice/v1._HistoryService_StartWorkflowExecution_Handler.func1
/home/builder/temporal/api/historyservice/v1/service.pb.go:1157
go.temporal.io/server/common/rpc/interceptor.(*RetryableInterceptor).Intercept.func1
/home/builder/temporal/common/rpc/interceptor/retry.go:63
go.temporal.io/server/common/backoff.ThrottleRetryContext
/home/builder/temporal/common/backoff/retry.go:145
go.temporal.io/server/common/rpc/interceptor.(*RetryableInterceptor).Intercept
/home/builder/temporal/common/rpc/interceptor/retry.go:67
google.golang.org/grpc.getChainUnaryHandler.func1
/go/pkg/mod/google.golang.org/grpc@v1.58.2/server.go:1195
go.temporal.io/server/common/rpc/interceptor.(*RateLimitInterceptor).Intercept
/home/builder/temporal/common/rpc/interceptor/rate_limit.go:88
google.golang.org/grpc.getChainUnaryHandler.func1
/go/pkg/mod/google.golang.org/grpc@v1.58.2/server.go:1195
go.temporal.io/server/common/rpc/interceptor.(*TelemetryInterceptor).UnaryIntercept
/home/builder/temporal/common/rpc/interceptor/telemetry.go:165
google.golang.org/grpc.getChainUnaryHandler.func1
/go/pkg/mod/google.golang.org/grpc@v1.58.2/server.go:1195
go.temporal.io/server/common/metrics.NewServerMetricsTrailerPropagatorInterceptor.func1
/home/builder/temporal/common/metrics/grpc.go:113
google.golang.org/grpc.getChainUnaryHandler.func1
/go/pkg/mod/google.golang.org/grpc@v1.58.2/server.go:1195
go.temporal.io/server/common/metrics.NewServerMetricsContextInjectorInterceptor.func1
/home/builder/temporal/common/metrics/grpc.go:66
google.golang.org/grpc.getChainUnaryHandler.func1
/go/pkg/mod/google.golang.org/grpc@v1.58.2/server.go:1195
go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1
/go/pkg/mod/go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc@v0.42.0/interceptor.go:344
google.golang.org/grpc.getChainUnaryHandler.func1
/go/pkg/mod/google.golang.org/grpc@v1.58.2/server.go:1195
go.temporal.io/server/common/rpc.ServiceErrorInterceptor
/home/builder/temporal/common/rpc/grpc.go:145
google.golang.org/grpc.chainUnaryInterceptors.func1
/go/pkg/mod/google.golang.org/grpc@v1.58.2/server.go:1186
go.temporal.io/server/api/historyservice/v1._HistoryService_StartWorkflowExecution_Handler
/home/builder/temporal/api/historyservice/v1/service.pb.go:1159
google.golang.org/grpc.(*Server).processUnaryRPC
/go/pkg/mod/google.golang.org/grpc@v1.58.2/server.go:1376
google.golang.org/grpc.(*Server).handleStream
/go/pkg/mod/google.golang.org/grpc@v1.58.2/server.go:1753
google.golang.org/grpc.(*Server).serveStreams.func1.1
/go/pkg/mod/google.golang.org/grpc@v1.58.2/server.go:998
further relevant info:
store-operation: create-wf-execution
logging-call-at: transaction_impl.go:381
error: shard status unknown
temporal version: 1.22.2
Is this a bug with temporal? Any guidance on how we can try to prevent these errors in the future?