Temporal production deployment stopped working

EslamNasser · December 22, 2022, 2:41pm

Hello

We’ve been using temporal on production for about a year or more now, our temporal server is deployed on a k8s cluster on GCP and we are using Mysql db and Elastic search.
Yesterday out of nowhere the history node started throwing the following errors:

{"error":"shard status unknown", "level":"error", "logging-call-at":"transaction_impl.go:432", "msg":"Persistent fetch operation Failure", "shard-id":27, "stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error /home/builder/temporal/common/log/zap_logger.go:143 go.temporal.io/server/service/history/workflow.getWorkflowExecution /home/builder/temporal/service/history/workflow/transaction_impl.go:432 go.temporal.io/server/service/history/workflow.(*ContextImpl).LoadMutableState /home/builder/temporal/service/history/workflow/context.go:270 go.temporal.io/server/service/history.LoadMutableStateForTask /home/builder/temporal/service/history/nDCTaskUtil.go:142 go.temporal.io/server/service/history.loadMutableStateForTimerTask /home/builder/temporal/service/history/nDCTaskUtil.go:123 go.temporal.io/server/service/history.(*timerQueueTaskExecutorBase).executeDeleteHistoryEventTask /home/builder/temporal/service/history/timerQueueTaskExecutorBase.go:108 go.temporal.io/server/service/history.(*timerQueueActiveTaskExecutor).Execute /home/builder/temporal/service/history/timerQueueActiveTaskExecutor.go:118 go.temporal.io/server/service/history/queues.(*executorWrapper).Execute /home/builder/temporal/service/history/queues/executor_wrapper.go:67 go.temporal.io/server/service/history/queues.(*executableImpl).Execute /home/builder/temporal/service/history/queues/executable.go:201 go.temporal.io/server/common/tasks.(*FIFOScheduler[...]).executeTask.func1 /home/builder/temporal/common/tasks/fifo_scheduler.go:225 go.temporal.io/server/common/backoff.ThrottleRetry.func1 /home/builder/temporal/common/backoff/retry.go:170 go.temporal.io/server/common/backoff.ThrottleRetryContext /home/builder/temporal/common/backoff/retry.go:194 go.temporal.io/server/common/backoff.ThrottleRetry /home/builder/temporal/common/backoff/retry.go:171 go.temporal.io/server/common/tasks.(*FIFOScheduler[...]).executeTask /home/builder/temporal/common/tasks/fifo_scheduler.go:235 go.temporal.io/server/common/tasks.(*FIFOScheduler[...]).processTask /home/builder/temporal/common/tasks/fifo_scheduler.go:211", "store-operation":"get-wf-execution", "ts":"2022-12-22T01:54:40.974Z", "wf-id":"d2f7b5c0-0728-4dfa-9035-ac4b79b0c47c", "wf-namespace-id":"cabd9d59-c22b-4a6f-b9b4-2f1f37a365c8", "wf-run-id":"47c1354a-40a8-4635-8d60-ad4709c5e2c9"}

{"error":"context deadline exceeded", "level":"error", "logging-call-at":"transaction_impl.go:432", "msg":"Persistent fetch operation Failure", "shard-id":310, "stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error /home/builder/temporal/common/log/zap_logger.go:143 go.temporal.io/server/service/history/workflow.getWorkflowExecution /home/builder/temporal/service/history/workflow/transaction_impl.go:432 go.temporal.io/server/service/history/workflow.(*ContextImpl).LoadMutableState /home/builder/temporal/service/history/workflow/context.go:270 go.temporal.io/server/service/history.LoadMutableStateForTask /home/builder/temporal/service/history/nDCTaskUtil.go:142 go.temporal.io/server/service/history.loadMutableStateForTimerTask /home/builder/temporal/service/history/nDCTaskUtil.go:123 go.temporal.io/server/service/history.(*timerQueueTaskExecutorBase).executeDeleteHistoryEventTask /home/builder/temporal/service/history/timerQueueTaskExecutorBase.go:108 go.temporal.io/server/service/history.(*timerQueueActiveTaskExecutor).Execute /home/builder/temporal/service/history/timerQueueActiveTaskExecutor.go:118 go.temporal.io/server/service/history/queues.(*executorWrapper).Execute /home/builder/temporal/service/history/queues/executor_wrapper.go:67 go.temporal.io/server/service/history/queues.(*executableImpl).Execute /home/builder/temporal/service/history/queues/executable.go:201 go.temporal.io/server/common/tasks.(*FIFOScheduler[...]).executeTask.func1 /home/builder/temporal/common/tasks/fifo_scheduler.go:225 go.temporal.io/server/common/backoff.ThrottleRetry.func1 /home/builder/temporal/common/backoff/retry.go:170 go.temporal.io/server/common/backoff.ThrottleRetryContext /home/builder/temporal/common/backoff/retry.go:194 go.temporal.io/server/common/backoff.ThrottleRetry /home/builder/temporal/common/backoff/retry.go:171 go.temporal.io/server/common/tasks.(*FIFOScheduler[...]).executeTask /home/builder/temporal/common/tasks/fifo_scheduler.go:235 go.temporal.io/server/common/tasks.(*FIFOScheduler[...]).processTask /home/builder/temporal/common/tasks/fifo_scheduler.go:211", "store-operation":"get-wf-execution", "ts":"2022-12-22T01:54:40.969Z", "wf-id":"aa875549-7ef7-3b43-a9e6-ec3dcc75696e", "wf-namespace-id":"cabd9d59-c22b-4a6f-b9b4-2f1f37a365c8", "wf-run-id":"6439e3c5-4c8c-4e81-be64-dd453b23378b"}

and after that any workflow worker was not able to connect to the temporal server and all started throwing the following error:

io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 9.999808134s. at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:262) at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:243) at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:156) at io.temporal.api.workflowservice.v1.WorkflowServiceGrpc$WorkflowServiceBlockingStub.respondActivityTaskFailed(WorkflowServiceGrpc.java:2776) at io.temporal.internal.external.ManualActivityCompletionClientImpl.lambda$fail$1(ManualActivityCompletionClientImpl.java:173) at io.temporal.internal.common.GrpcRetryer.lambda$retry$0(GrpcRetryer.java:109) at io.temporal.internal.common.GrpcRetryer.retryWithResult(GrpcRetryer.java:127) at io.temporal.internal.common.GrpcRetryer.retry(GrpcRetryer.java:106) at io.temporal.internal.external.ManualActivityCompletionClientImpl.fail(ManualActivityCompletionClientImpl.java:167) at io.temporal.internal.sync.ActivityCompletionClientImpl.completeExceptionally(ActivityCompletionClientImpl.java:49) at com.yodawy.bridge.workflows.activities.GetOrder.lambda$getOrder$1(GetOrder.java:74) at io.vertx.core.impl.future.FutureImpl$2.onFailure(FutureImpl.java:117) at io.vertx.core.impl.future.FutureImpl$ListenerArray.onFailure(FutureImpl.java:268) at io.vertx.core.impl.future.FutureBase.emitFailure(FutureBase.java:75) at io.vertx.core.impl.future.FutureImpl.tryFail(FutureImpl.java:230) at io.vertx.core.impl.future.PromiseImpl.tryFail(PromiseImpl.java:23) at io.vertx.core.impl.future.PromiseImpl.onFailure(PromiseImpl.java:54) at io.vertx.core.impl.future.PromiseImpl.handle(PromiseImpl.java:43) at io.vertx.core.impl.future.PromiseImpl.handle(PromiseImpl.java:23) at io.vertx.ext.web.client.impl.HttpContext.handleFailure(HttpContext.java:367) at io.vertx.ext.web.client.impl.HttpContext.execute(HttpContext.java:361) at io.vertx.ext.web.client.impl.HttpContext.next(HttpContext.java:336) at io.vertx.ext.web.client.impl.HttpContext.fire(HttpContext.java:303) at io.vertx.ext.web.client.impl.HttpContext.fail(HttpContext.java:284) at io.vertx.ext.web.client.impl.HttpContext.lambda$handleCreateRequest$7(HttpContext.java:514) at io.vertx.core.impl.future.FutureImpl$3.onFailure(FutureImpl.java:153) at io.vertx.core.impl.future.FutureBase.emitFailure(FutureBase.java:75) at io.vertx.core.impl.future.FutureImpl.tryFail(FutureImpl.java:230) at io.vertx.core.impl.future.PromiseImpl.tryFail(PromiseImpl.java:23) at io.vertx.core.http.impl.HttpClientImpl.lambda$doRequest$8(HttpClientImpl.java:654) at io.vertx.core.net.impl.pool.Endpoint.lambda$getConnection$0(Endpoint.java:52) at io.vertx.core.http.impl.SharedClientHttpStreamEndpoint$Request.lambda$null$0(SharedClientHttpStreamEndpoint.java:150) at io.vertx.core.net.impl.pool.SimpleConnectionPool$Cancel.run(SimpleConnectionPool.java:666) at io.vertx.core.net.impl.pool.CombinerExecutor.submit(CombinerExecutor.java:50) at io.vertx.core.net.impl.pool.SimpleConnectionPool.execute(SimpleConnectionPool.java:240) at io.vertx.core.net.impl.pool.SimpleConnectionPool.cancel(SimpleConnectionPool.java:629) at io.vertx.core.http.impl.SharedClientHttpStreamEndpoint$Request.lambda$onConnect$1(SharedClientHttpStreamEndpoint.java:148) at io.vertx.core.impl.VertxImpl$InternalTimerHandler.handle(VertxImpl.java:893) at io.vertx.core.impl.VertxImpl$InternalTimerHandler.handle(VertxImpl.java:860) at io.vertx.core.impl.EventLoopContext.emit(EventLoopContext.java:50) at io.vertx.core.impl.ContextImpl.emit(ContextImpl.java:274) at io.vertx.core.impl.EventLoopContext.emit(EventLoopContext.java:22) at io.vertx.core.impl.AbstractContext.emit(AbstractContext.java:53) at io.vertx.core.impl.EventLoopContext.emit(EventLoopContext.java:22) at io.vertx.core.impl.VertxImpl$InternalTimerHandler.run(VertxImpl.java:883) at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170) at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:748)

And to try to understand where the problem is coming from we made a new deployment with a new database and that one worked just fine even after pointing our workloads towards it.

And after reading more about the error we understand that there is some issue with the shards table in the database but we are not really sure what went wrong and if there is any way to fix it.

Appreciate all the help i can get on this.

Thank you

tihomir · December 22, 2022, 3:12pm

Whats the server version you are deploying?

Could you check if you see if any of your services logging
“Persistence Max QPS Reached” errors? If so depending on which service does try increasing in dynamic config:

frontend.persistenceMaxQPS
history.persistenceMaxQPS
matching.persistenceMaxQPS

If you have server metrics enabled also could check for resource exhausted errors with cause:

sum(rate(service_errors_resource_exhausted{}[1m])) by (resource_exhausted_cause)

EslamNasser · December 22, 2022, 3:18pm

When the issue happened we were running version 1.18.0 with elastic v8.

I will take a look at the logs to see if anything threw the error you specified and get back to you on that.

EslamNasser · December 22, 2022, 3:53pm

I looked through our logs and i can’t find any errors throw that match “Persistence Max QPS Reached”.
and we didn’t have server metrics enabled so i can’t look for resource exhaustion

Dhiraj_Bhakta · January 15, 2023, 7:20pm

@tihomir what does frontend.persistenceMaxQPS do? Does frontend service really hit the DB directly?.. aren’t all DB calls via matching/history?

tihomir · January 15, 2023, 9:21pm

@Dhiraj_Bhakta yes frontend can read directly, for example it reads visibility data (db if standard vis is configured) without having to go through the other services, as well as reads event histories for your sdk workers when they request it.

Dhiraj_Bhakta · January 15, 2023, 9:30pm

Sorry to sidetrack the main discussion…
But is horizontally scaling the frontend service a solution to mitigate “resource exhaustion on concurrent limits” (increasing pods from say 3->5, without changing any frontend config)?

the other option being increasing the default frontend.persistenceMaxQPS ?

tihomir · January 15, 2023, 9:43pm

Yeah it would be better to create a new post for these questions that not related to original post, thanks.

For the question tho, it depends on the resource exhausted cause. For QPS limit increase you would look at SystemOverloaded, for rps limits RpsLimit cause.
I think its not recommended to just blindly add more frontend pods / increase qps dynamic config values as it can lead to possibly overloading your persistence store and lead to outages.

Frontend service is for the most part CPU heavy so watching cpu utilization as well can be useful.

Topic		Replies	Views
DEADLINE_EXCEEDED: deadline exceeded after 9.999933037s Community Support java-sdk	9	2538	July 13, 2023
Service rate limit exceeded Community Support go-sdk , helm , metrics	16	2768	August 2, 2023
Matching service start/stop loop in production deployment Community Support	2	2287	November 5, 2020
Cross-region MySQL Deployment - context deadline exceeded errors in history pod Server Deployment java-sdk , mysql , helm , database , kubernetes	0	33	November 21, 2025
Deadline exceeded error with helm charts deployment Community Support java-sdk , helm	7	6714	January 11, 2022

Temporal production deployment stopped working

Related topics