Temporal production deployment stopped working

Hello

We’ve been using temporal on production for about a year or more now, our temporal server is deployed on a k8s cluster on GCP and we are using Mysql db and Elastic search.
Yesterday out of nowhere the history node started throwing the following errors:

{"error":"shard status unknown", "level":"error", "logging-call-at":"transaction_impl.go:432", "msg":"Persistent fetch operation Failure", "shard-id":27, "stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error /home/builder/temporal/common/log/zap_logger.go:143 go.temporal.io/server/service/history/workflow.getWorkflowExecution /home/builder/temporal/service/history/workflow/transaction_impl.go:432 go.temporal.io/server/service/history/workflow.(*ContextImpl).LoadMutableState /home/builder/temporal/service/history/workflow/context.go:270 go.temporal.io/server/service/history.LoadMutableStateForTask /home/builder/temporal/service/history/nDCTaskUtil.go:142 go.temporal.io/server/service/history.loadMutableStateForTimerTask /home/builder/temporal/service/history/nDCTaskUtil.go:123 go.temporal.io/server/service/history.(*timerQueueTaskExecutorBase).executeDeleteHistoryEventTask /home/builder/temporal/service/history/timerQueueTaskExecutorBase.go:108 go.temporal.io/server/service/history.(*timerQueueActiveTaskExecutor).Execute /home/builder/temporal/service/history/timerQueueActiveTaskExecutor.go:118 go.temporal.io/server/service/history/queues.(*executorWrapper).Execute /home/builder/temporal/service/history/queues/executor_wrapper.go:67 go.temporal.io/server/service/history/queues.(*executableImpl).Execute /home/builder/temporal/service/history/queues/executable.go:201 go.temporal.io/server/common/tasks.(*FIFOScheduler[...]).executeTask.func1 /home/builder/temporal/common/tasks/fifo_scheduler.go:225 go.temporal.io/server/common/backoff.ThrottleRetry.func1 /home/builder/temporal/common/backoff/retry.go:170 go.temporal.io/server/common/backoff.ThrottleRetryContext /home/builder/temporal/common/backoff/retry.go:194 go.temporal.io/server/common/backoff.ThrottleRetry /home/builder/temporal/common/backoff/retry.go:171 go.temporal.io/server/common/tasks.(*FIFOScheduler[...]).executeTask /home/builder/temporal/common/tasks/fifo_scheduler.go:235 go.temporal.io/server/common/tasks.(*FIFOScheduler[...]).processTask /home/builder/temporal/common/tasks/fifo_scheduler.go:211", "store-operation":"get-wf-execution", "ts":"2022-12-22T01:54:40.974Z", "wf-id":"d2f7b5c0-0728-4dfa-9035-ac4b79b0c47c", "wf-namespace-id":"cabd9d59-c22b-4a6f-b9b4-2f1f37a365c8", "wf-run-id":"47c1354a-40a8-4635-8d60-ad4709c5e2c9"}

{"error":"context deadline exceeded", "level":"error", "logging-call-at":"transaction_impl.go:432", "msg":"Persistent fetch operation Failure", "shard-id":310, "stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error /home/builder/temporal/common/log/zap_logger.go:143 go.temporal.io/server/service/history/workflow.getWorkflowExecution /home/builder/temporal/service/history/workflow/transaction_impl.go:432 go.temporal.io/server/service/history/workflow.(*ContextImpl).LoadMutableState /home/builder/temporal/service/history/workflow/context.go:270 go.temporal.io/server/service/history.LoadMutableStateForTask /home/builder/temporal/service/history/nDCTaskUtil.go:142 go.temporal.io/server/service/history.loadMutableStateForTimerTask /home/builder/temporal/service/history/nDCTaskUtil.go:123 go.temporal.io/server/service/history.(*timerQueueTaskExecutorBase).executeDeleteHistoryEventTask /home/builder/temporal/service/history/timerQueueTaskExecutorBase.go:108 go.temporal.io/server/service/history.(*timerQueueActiveTaskExecutor).Execute /home/builder/temporal/service/history/timerQueueActiveTaskExecutor.go:118 go.temporal.io/server/service/history/queues.(*executorWrapper).Execute /home/builder/temporal/service/history/queues/executor_wrapper.go:67 go.temporal.io/server/service/history/queues.(*executableImpl).Execute /home/builder/temporal/service/history/queues/executable.go:201 go.temporal.io/server/common/tasks.(*FIFOScheduler[...]).executeTask.func1 /home/builder/temporal/common/tasks/fifo_scheduler.go:225 go.temporal.io/server/common/backoff.ThrottleRetry.func1 /home/builder/temporal/common/backoff/retry.go:170 go.temporal.io/server/common/backoff.ThrottleRetryContext /home/builder/temporal/common/backoff/retry.go:194 go.temporal.io/server/common/backoff.ThrottleRetry /home/builder/temporal/common/backoff/retry.go:171 go.temporal.io/server/common/tasks.(*FIFOScheduler[...]).executeTask /home/builder/temporal/common/tasks/fifo_scheduler.go:235 go.temporal.io/server/common/tasks.(*FIFOScheduler[...]).processTask /home/builder/temporal/common/tasks/fifo_scheduler.go:211", "store-operation":"get-wf-execution", "ts":"2022-12-22T01:54:40.969Z", "wf-id":"aa875549-7ef7-3b43-a9e6-ec3dcc75696e", "wf-namespace-id":"cabd9d59-c22b-4a6f-b9b4-2f1f37a365c8", "wf-run-id":"6439e3c5-4c8c-4e81-be64-dd453b23378b"}

and after that any workflow worker was not able to connect to the temporal server and all started throwing the following error:

io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 9.999808134s. at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:262) at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:243) at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:156) at io.temporal.api.workflowservice.v1.WorkflowServiceGrpc$WorkflowServiceBlockingStub.respondActivityTaskFailed(WorkflowServiceGrpc.java:2776) at io.temporal.internal.external.ManualActivityCompletionClientImpl.lambda$fail$1(ManualActivityCompletionClientImpl.java:173) at io.temporal.internal.common.GrpcRetryer.lambda$retry$0(GrpcRetryer.java:109) at io.temporal.internal.common.GrpcRetryer.retryWithResult(GrpcRetryer.java:127) at io.temporal.internal.common.GrpcRetryer.retry(GrpcRetryer.java:106) at io.temporal.internal.external.ManualActivityCompletionClientImpl.fail(ManualActivityCompletionClientImpl.java:167) at io.temporal.internal.sync.ActivityCompletionClientImpl.completeExceptionally(ActivityCompletionClientImpl.java:49) at com.yodawy.bridge.workflows.activities.GetOrder.lambda$getOrder$1(GetOrder.java:74) at io.vertx.core.impl.future.FutureImpl$2.onFailure(FutureImpl.java:117) at io.vertx.core.impl.future.FutureImpl$ListenerArray.onFailure(FutureImpl.java:268) at io.vertx.core.impl.future.FutureBase.emitFailure(FutureBase.java:75) at io.vertx.core.impl.future.FutureImpl.tryFail(FutureImpl.java:230) at io.vertx.core.impl.future.PromiseImpl.tryFail(PromiseImpl.java:23) at io.vertx.core.impl.future.PromiseImpl.onFailure(PromiseImpl.java:54) at io.vertx.core.impl.future.PromiseImpl.handle(PromiseImpl.java:43) at io.vertx.core.impl.future.PromiseImpl.handle(PromiseImpl.java:23) at io.vertx.ext.web.client.impl.HttpContext.handleFailure(HttpContext.java:367) at io.vertx.ext.web.client.impl.HttpContext.execute(HttpContext.java:361) at io.vertx.ext.web.client.impl.HttpContext.next(HttpContext.java:336) at io.vertx.ext.web.client.impl.HttpContext.fire(HttpContext.java:303) at io.vertx.ext.web.client.impl.HttpContext.fail(HttpContext.java:284) at io.vertx.ext.web.client.impl.HttpContext.lambda$handleCreateRequest$7(HttpContext.java:514) at io.vertx.core.impl.future.FutureImpl$3.onFailure(FutureImpl.java:153) at io.vertx.core.impl.future.FutureBase.emitFailure(FutureBase.java:75) at io.vertx.core.impl.future.FutureImpl.tryFail(FutureImpl.java:230) at io.vertx.core.impl.future.PromiseImpl.tryFail(PromiseImpl.java:23) at io.vertx.core.http.impl.HttpClientImpl.lambda$doRequest$8(HttpClientImpl.java:654) at io.vertx.core.net.impl.pool.Endpoint.lambda$getConnection$0(Endpoint.java:52) at io.vertx.core.http.impl.SharedClientHttpStreamEndpoint$Request.lambda$null$0(SharedClientHttpStreamEndpoint.java:150) at io.vertx.core.net.impl.pool.SimpleConnectionPool$Cancel.run(SimpleConnectionPool.java:666) at io.vertx.core.net.impl.pool.CombinerExecutor.submit(CombinerExecutor.java:50) at io.vertx.core.net.impl.pool.SimpleConnectionPool.execute(SimpleConnectionPool.java:240) at io.vertx.core.net.impl.pool.SimpleConnectionPool.cancel(SimpleConnectionPool.java:629) at io.vertx.core.http.impl.SharedClientHttpStreamEndpoint$Request.lambda$onConnect$1(SharedClientHttpStreamEndpoint.java:148) at io.vertx.core.impl.VertxImpl$InternalTimerHandler.handle(VertxImpl.java:893) at io.vertx.core.impl.VertxImpl$InternalTimerHandler.handle(VertxImpl.java:860) at io.vertx.core.impl.EventLoopContext.emit(EventLoopContext.java:50) at io.vertx.core.impl.ContextImpl.emit(ContextImpl.java:274) at io.vertx.core.impl.EventLoopContext.emit(EventLoopContext.java:22) at io.vertx.core.impl.AbstractContext.emit(AbstractContext.java:53) at io.vertx.core.impl.EventLoopContext.emit(EventLoopContext.java:22) at io.vertx.core.impl.VertxImpl$InternalTimerHandler.run(VertxImpl.java:883) at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170) at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:748)

And to try to understand where the problem is coming from we made a new deployment with a new database and that one worked just fine even after pointing our workloads towards it.

And after reading more about the error we understand that there is some issue with the shards table in the database but we are not really sure what went wrong and if there is any way to fix it.

Appreciate all the help i can get on this.

Thank you

Whats the server version you are deploying?

Could you check if you see if any of your services logging
“Persistence Max QPS Reached” errors? If so depending on which service does try increasing in dynamic config:

frontend.persistenceMaxQPS
history.persistenceMaxQPS
matching.persistenceMaxQPS

If you have server metrics enabled also could check for resource exhausted errors with cause:

sum(rate(service_errors_resource_exhausted{}[1m])) by (resource_exhausted_cause)

When the issue happened we were running version 1.18.0 with elastic v8.

I will take a look at the logs to see if anything threw the error you specified and get back to you on that.

I looked through our logs and i can’t find any errors throw that match “Persistence Max QPS Reached”.
and we didn’t have server metrics enabled so i can’t look for resource exhaustion

@tihomir what does frontend.persistenceMaxQPS do? Does frontend service really hit the DB directly?.. aren’t all DB calls via matching/history?

@Dhiraj_Bhakta yes frontend can read directly, for example it reads visibility data (db if standard vis is configured) without having to go through the other services, as well as reads event histories for your sdk workers when they request it.

Sorry to sidetrack the main discussion…
But is horizontally scaling the frontend service a solution to mitigate “resource exhaustion on concurrent limits” (increasing pods from say 3->5, without changing any frontend config)?

the other option being increasing the default frontend.persistenceMaxQPS ?

Yeah it would be better to create a new post for these questions that not related to original post, thanks.

For the question tho, it depends on the resource exhausted cause. For QPS limit increase you would look at SystemOverloaded, for rps limits RpsLimit cause.
I think its not recommended to just blindly add more frontend pods / increase qps dynamic config values as it can lead to possibly overloading your persistence store and lead to outages.

Frontend service is for the most part CPU heavy so watching cpu utilization as well can be useful.