Workflow stuck at `WorkflowTaskScheduled` due to `Persistent store operation Failure` on the Temporal History

During periods of high load, we are seeing some workflows being stuck at WorkflowTaskScheduled for multiple days despite having active workers on the queues.

Looking into our datadog logs for the workflow, we are seeing this error:

Persistent store operation Failure

with the following trace:

go.temporal.io/server/common/log.(*zapLogger).Error
	/home/builder/temporal/common/log/zap_logger.go:156
go.temporal.io/server/service/history/workflow.createWorkflowExecution
	/home/builder/temporal/service/history/workflow/transaction_impl.go:381
go.temporal.io/server/service/history/workflow.(*ContextImpl).CreateWorkflowExecution
	/home/builder/temporal/service/history/workflow/context.go:334
go.temporal.io/server/service/history/api/startworkflow.(*Starter).createBrandNew
	/home/builder/temporal/service/history/api/startworkflow/api.go:250
go.temporal.io/server/service/history/api/startworkflow.(*Starter).Invoke
	/home/builder/temporal/service/history/api/startworkflow/api.go:178
go.temporal.io/server/service/history.(*historyEngineImpl).StartWorkflowExecution
	/home/builder/temporal/service/history/history_engine.go:358
go.temporal.io/server/service/history.(*Handler).StartWorkflowExecution
	/home/builder/temporal/service/history/handler.go:588
go.temporal.io/server/api/historyservice/v1._HistoryService_StartWorkflowExecution_Handler.func1
	/home/builder/temporal/api/historyservice/v1/service.pb.go:1157
go.temporal.io/server/common/rpc/interceptor.(*RetryableInterceptor).Intercept.func1
	/home/builder/temporal/common/rpc/interceptor/retry.go:63
go.temporal.io/server/common/backoff.ThrottleRetryContext
	/home/builder/temporal/common/backoff/retry.go:145
go.temporal.io/server/common/rpc/interceptor.(*RetryableInterceptor).Intercept
	/home/builder/temporal/common/rpc/interceptor/retry.go:67
google.golang.org/grpc.getChainUnaryHandler.func1
	/go/pkg/mod/google.golang.org/grpc@v1.58.2/server.go:1195
go.temporal.io/server/common/rpc/interceptor.(*RateLimitInterceptor).Intercept
	/home/builder/temporal/common/rpc/interceptor/rate_limit.go:88
google.golang.org/grpc.getChainUnaryHandler.func1
	/go/pkg/mod/google.golang.org/grpc@v1.58.2/server.go:1195
go.temporal.io/server/common/rpc/interceptor.(*TelemetryInterceptor).UnaryIntercept
	/home/builder/temporal/common/rpc/interceptor/telemetry.go:165
google.golang.org/grpc.getChainUnaryHandler.func1
	/go/pkg/mod/google.golang.org/grpc@v1.58.2/server.go:1195
go.temporal.io/server/common/metrics.NewServerMetricsTrailerPropagatorInterceptor.func1
	/home/builder/temporal/common/metrics/grpc.go:113
google.golang.org/grpc.getChainUnaryHandler.func1
	/go/pkg/mod/google.golang.org/grpc@v1.58.2/server.go:1195
go.temporal.io/server/common/metrics.NewServerMetricsContextInjectorInterceptor.func1
	/home/builder/temporal/common/metrics/grpc.go:66
google.golang.org/grpc.getChainUnaryHandler.func1
	/go/pkg/mod/google.golang.org/grpc@v1.58.2/server.go:1195
go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1
	/go/pkg/mod/go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc@v0.42.0/interceptor.go:344
google.golang.org/grpc.getChainUnaryHandler.func1
	/go/pkg/mod/google.golang.org/grpc@v1.58.2/server.go:1195
go.temporal.io/server/common/rpc.ServiceErrorInterceptor
	/home/builder/temporal/common/rpc/grpc.go:145
google.golang.org/grpc.chainUnaryInterceptors.func1
	/go/pkg/mod/google.golang.org/grpc@v1.58.2/server.go:1186
go.temporal.io/server/api/historyservice/v1._HistoryService_StartWorkflowExecution_Handler
	/home/builder/temporal/api/historyservice/v1/service.pb.go:1159
google.golang.org/grpc.(*Server).processUnaryRPC
	/go/pkg/mod/google.golang.org/grpc@v1.58.2/server.go:1376
google.golang.org/grpc.(*Server).handleStream
	/go/pkg/mod/google.golang.org/grpc@v1.58.2/server.go:1753
google.golang.org/grpc.(*Server).serveStreams.func1.1
	/go/pkg/mod/google.golang.org/grpc@v1.58.2/server.go:998

further relevant info:

store-operation: create-wf-execution
logging-call-at: transaction_impl.go:381
error: shard status unknown

temporal version: 1.22.2

Is this a bug with temporal? Any guidance on how we can try to prevent these errors in the future?

error: shard status unknown

This typically means the shard is not operational as in service received timeout from the db, this call was retried a number of times resulting in db timeout continuously.

Do you see any errors on db side? On server metrics side can you look at persistence latencies and failures (metrics persistence_latency and persistence_error_with_type)

We do see errors on the db side. We have tuned some of our RPS limits to prevent the db from getting hammered.

Outside of this, do you have any suggestions on how we can ensure that workflows get eventually persisted no matter how many retries are needed from the db?