Hello
We’ve been using temporal on production for about a year or more now, our temporal server is deployed on a k8s cluster on GCP and we are using Mysql db and Elastic search.
Yesterday out of nowhere the history node started throwing the following errors:
{"error":"shard status unknown", "level":"error", "logging-call-at":"transaction_impl.go:432", "msg":"Persistent fetch operation Failure", "shard-id":27, "stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error /home/builder/temporal/common/log/zap_logger.go:143 go.temporal.io/server/service/history/workflow.getWorkflowExecution /home/builder/temporal/service/history/workflow/transaction_impl.go:432 go.temporal.io/server/service/history/workflow.(*ContextImpl).LoadMutableState /home/builder/temporal/service/history/workflow/context.go:270 go.temporal.io/server/service/history.LoadMutableStateForTask /home/builder/temporal/service/history/nDCTaskUtil.go:142 go.temporal.io/server/service/history.loadMutableStateForTimerTask /home/builder/temporal/service/history/nDCTaskUtil.go:123 go.temporal.io/server/service/history.(*timerQueueTaskExecutorBase).executeDeleteHistoryEventTask /home/builder/temporal/service/history/timerQueueTaskExecutorBase.go:108 go.temporal.io/server/service/history.(*timerQueueActiveTaskExecutor).Execute /home/builder/temporal/service/history/timerQueueActiveTaskExecutor.go:118 go.temporal.io/server/service/history/queues.(*executorWrapper).Execute /home/builder/temporal/service/history/queues/executor_wrapper.go:67 go.temporal.io/server/service/history/queues.(*executableImpl).Execute /home/builder/temporal/service/history/queues/executable.go:201 go.temporal.io/server/common/tasks.(*FIFOScheduler[...]).executeTask.func1 /home/builder/temporal/common/tasks/fifo_scheduler.go:225 go.temporal.io/server/common/backoff.ThrottleRetry.func1 /home/builder/temporal/common/backoff/retry.go:170 go.temporal.io/server/common/backoff.ThrottleRetryContext /home/builder/temporal/common/backoff/retry.go:194 go.temporal.io/server/common/backoff.ThrottleRetry /home/builder/temporal/common/backoff/retry.go:171 go.temporal.io/server/common/tasks.(*FIFOScheduler[...]).executeTask /home/builder/temporal/common/tasks/fifo_scheduler.go:235 go.temporal.io/server/common/tasks.(*FIFOScheduler[...]).processTask /home/builder/temporal/common/tasks/fifo_scheduler.go:211", "store-operation":"get-wf-execution", "ts":"2022-12-22T01:54:40.974Z", "wf-id":"d2f7b5c0-0728-4dfa-9035-ac4b79b0c47c", "wf-namespace-id":"cabd9d59-c22b-4a6f-b9b4-2f1f37a365c8", "wf-run-id":"47c1354a-40a8-4635-8d60-ad4709c5e2c9"}
{"error":"context deadline exceeded", "level":"error", "logging-call-at":"transaction_impl.go:432", "msg":"Persistent fetch operation Failure", "shard-id":310, "stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error /home/builder/temporal/common/log/zap_logger.go:143 go.temporal.io/server/service/history/workflow.getWorkflowExecution /home/builder/temporal/service/history/workflow/transaction_impl.go:432 go.temporal.io/server/service/history/workflow.(*ContextImpl).LoadMutableState /home/builder/temporal/service/history/workflow/context.go:270 go.temporal.io/server/service/history.LoadMutableStateForTask /home/builder/temporal/service/history/nDCTaskUtil.go:142 go.temporal.io/server/service/history.loadMutableStateForTimerTask /home/builder/temporal/service/history/nDCTaskUtil.go:123 go.temporal.io/server/service/history.(*timerQueueTaskExecutorBase).executeDeleteHistoryEventTask /home/builder/temporal/service/history/timerQueueTaskExecutorBase.go:108 go.temporal.io/server/service/history.(*timerQueueActiveTaskExecutor).Execute /home/builder/temporal/service/history/timerQueueActiveTaskExecutor.go:118 go.temporal.io/server/service/history/queues.(*executorWrapper).Execute /home/builder/temporal/service/history/queues/executor_wrapper.go:67 go.temporal.io/server/service/history/queues.(*executableImpl).Execute /home/builder/temporal/service/history/queues/executable.go:201 go.temporal.io/server/common/tasks.(*FIFOScheduler[...]).executeTask.func1 /home/builder/temporal/common/tasks/fifo_scheduler.go:225 go.temporal.io/server/common/backoff.ThrottleRetry.func1 /home/builder/temporal/common/backoff/retry.go:170 go.temporal.io/server/common/backoff.ThrottleRetryContext /home/builder/temporal/common/backoff/retry.go:194 go.temporal.io/server/common/backoff.ThrottleRetry /home/builder/temporal/common/backoff/retry.go:171 go.temporal.io/server/common/tasks.(*FIFOScheduler[...]).executeTask /home/builder/temporal/common/tasks/fifo_scheduler.go:235 go.temporal.io/server/common/tasks.(*FIFOScheduler[...]).processTask /home/builder/temporal/common/tasks/fifo_scheduler.go:211", "store-operation":"get-wf-execution", "ts":"2022-12-22T01:54:40.969Z", "wf-id":"aa875549-7ef7-3b43-a9e6-ec3dcc75696e", "wf-namespace-id":"cabd9d59-c22b-4a6f-b9b4-2f1f37a365c8", "wf-run-id":"6439e3c5-4c8c-4e81-be64-dd453b23378b"}
and after that any workflow worker was not able to connect to the temporal server and all started throwing the following error:
io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 9.999808134s. at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:262) at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:243) at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:156) at io.temporal.api.workflowservice.v1.WorkflowServiceGrpc$WorkflowServiceBlockingStub.respondActivityTaskFailed(WorkflowServiceGrpc.java:2776) at io.temporal.internal.external.ManualActivityCompletionClientImpl.lambda$fail$1(ManualActivityCompletionClientImpl.java:173) at io.temporal.internal.common.GrpcRetryer.lambda$retry$0(GrpcRetryer.java:109) at io.temporal.internal.common.GrpcRetryer.retryWithResult(GrpcRetryer.java:127) at io.temporal.internal.common.GrpcRetryer.retry(GrpcRetryer.java:106) at io.temporal.internal.external.ManualActivityCompletionClientImpl.fail(ManualActivityCompletionClientImpl.java:167) at io.temporal.internal.sync.ActivityCompletionClientImpl.completeExceptionally(ActivityCompletionClientImpl.java:49) at com.yodawy.bridge.workflows.activities.GetOrder.lambda$getOrder$1(GetOrder.java:74) at io.vertx.core.impl.future.FutureImpl$2.onFailure(FutureImpl.java:117) at io.vertx.core.impl.future.FutureImpl$ListenerArray.onFailure(FutureImpl.java:268) at io.vertx.core.impl.future.FutureBase.emitFailure(FutureBase.java:75) at io.vertx.core.impl.future.FutureImpl.tryFail(FutureImpl.java:230) at io.vertx.core.impl.future.PromiseImpl.tryFail(PromiseImpl.java:23) at io.vertx.core.impl.future.PromiseImpl.onFailure(PromiseImpl.java:54) at io.vertx.core.impl.future.PromiseImpl.handle(PromiseImpl.java:43) at io.vertx.core.impl.future.PromiseImpl.handle(PromiseImpl.java:23) at io.vertx.ext.web.client.impl.HttpContext.handleFailure(HttpContext.java:367) at io.vertx.ext.web.client.impl.HttpContext.execute(HttpContext.java:361) at io.vertx.ext.web.client.impl.HttpContext.next(HttpContext.java:336) at io.vertx.ext.web.client.impl.HttpContext.fire(HttpContext.java:303) at io.vertx.ext.web.client.impl.HttpContext.fail(HttpContext.java:284) at io.vertx.ext.web.client.impl.HttpContext.lambda$handleCreateRequest$7(HttpContext.java:514) at io.vertx.core.impl.future.FutureImpl$3.onFailure(FutureImpl.java:153) at io.vertx.core.impl.future.FutureBase.emitFailure(FutureBase.java:75) at io.vertx.core.impl.future.FutureImpl.tryFail(FutureImpl.java:230) at io.vertx.core.impl.future.PromiseImpl.tryFail(PromiseImpl.java:23) at io.vertx.core.http.impl.HttpClientImpl.lambda$doRequest$8(HttpClientImpl.java:654) at io.vertx.core.net.impl.pool.Endpoint.lambda$getConnection$0(Endpoint.java:52) at io.vertx.core.http.impl.SharedClientHttpStreamEndpoint$Request.lambda$null$0(SharedClientHttpStreamEndpoint.java:150) at io.vertx.core.net.impl.pool.SimpleConnectionPool$Cancel.run(SimpleConnectionPool.java:666) at io.vertx.core.net.impl.pool.CombinerExecutor.submit(CombinerExecutor.java:50) at io.vertx.core.net.impl.pool.SimpleConnectionPool.execute(SimpleConnectionPool.java:240) at io.vertx.core.net.impl.pool.SimpleConnectionPool.cancel(SimpleConnectionPool.java:629) at io.vertx.core.http.impl.SharedClientHttpStreamEndpoint$Request.lambda$onConnect$1(SharedClientHttpStreamEndpoint.java:148) at io.vertx.core.impl.VertxImpl$InternalTimerHandler.handle(VertxImpl.java:893) at io.vertx.core.impl.VertxImpl$InternalTimerHandler.handle(VertxImpl.java:860) at io.vertx.core.impl.EventLoopContext.emit(EventLoopContext.java:50) at io.vertx.core.impl.ContextImpl.emit(ContextImpl.java:274) at io.vertx.core.impl.EventLoopContext.emit(EventLoopContext.java:22) at io.vertx.core.impl.AbstractContext.emit(AbstractContext.java:53) at io.vertx.core.impl.EventLoopContext.emit(EventLoopContext.java:22) at io.vertx.core.impl.VertxImpl$InternalTimerHandler.run(VertxImpl.java:883) at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170) at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:748)
And to try to understand where the problem is coming from we made a new deployment with a new database and that one worked just fine even after pointing our workloads towards it.
And after reading more about the error we understand that there is some issue with the shards table in the database but we are not really sure what went wrong and if there is any way to fix it.
Appreciate all the help i can get on this.
Thank you