DEADLINE_EXCEEDED: deadline exceeded after 9.999933037s

We are experiencing an issue that affects the availability of the Temporal cluster. The error shows up in Temporal frontend pods.
The database looks ok but a bit overloaded while experiencing this issue.
The UI lists the workflows, but when clicking on any, it shows β€œ504 Gateway Time-out.”
Restarting the whole cluster did not help.
The setup uses the helm chart.
Some workflows are stuck in a running state.

@maxim @tihomir, could you help please

The command tctl cluster health gets:
temporal.api.workflowservice.v1.WorkflowService: SERVING

"level":"error","ts":"2022-05-23T12:59:57.280Z","msg":"Unable to call matching.PollWorkflowTaskQueue.","service":"frontend","wf-task-queue-name":"temporal-archival-tq","timeout":"1m9.999572483s","error":"context deadline exceeded","logging-call-at":"workflowHandler.go:812
"stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error
    /temporal/common/log/zap_logger.go:142
go.temporal.io/server/service/frontend.(*WorkflowHandler).PollWorkflowTaskQueue
    /temporal/service/frontend/workflowHandler.go:812
go.temporal.io/server/service/frontend.(*DCRedirectionHandlerImpl).PollWorkflowTaskQueue.func2
    /temporal/service/frontend/dcRedirectionHandler.go:532
go.temporal.io/server/service/frontend.(*NoopRedirectionPolicy).WithNamespaceRedirect
    /temporal/service/frontend/dcRedirectionPolicy.go:118
go.temporal.io/server/service/frontend.(*DCRedirectionHandlerImpl).PollWorkflowTaskQueue
    /temporal/service/frontend/dcRedirectionHandler.go:528
go.temporal.io/api/workflowservice/v1._WorkflowService_PollWorkflowTaskQueue_Handler.func1
    /go/pkg/mod/go.temporal.io/api@v1.5.1-0.20211018190919-a5f4a169cd08/workflowservice/v1/service.pb.go:1088
go.temporal.io/server/common/authorization.(*interceptor).Interceptor
    /temporal/common/authorization/interceptor.go:152
google.golang.org/grpc.chainUnaryInterceptors.func1.1
    /go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1113
go.temporal.io/server/common/rpc/interceptor.(*NamespaceCountLimitInterceptor).Intercept
    /temporal/common/rpc/interceptor/namespace_count_limit.go:89
google.golang.org/grpc.chainUnaryInterceptors.func1.1
    /go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1116
go.temporal.io/server/common/rpc/interceptor.(*NamespaceRateLimitInterceptor).Intercept
    /temporal/common/rpc/interceptor/namespace_rate_limit.go:87
google.golang.org/grpc.chainUnaryInterceptors.func1.1
    /go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1116
go.temporal.io/server/common/rpc/interceptor.(*RateLimitInterceptor).Intercept
    /temporal/common/rpc/interceptor/rate_limit.go:83
google.golang.org/grpc.chainUnaryInterceptors.func1.1
    /go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1116
go.temporal.io/server/common/rpc/interceptor.(*TelemetryInterceptor).Intercept
    /temporal/common/rpc/interceptor/telemetry.go:108
google.golang.org/grpc.chainUnaryInterceptors.func1.1
    /go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1116
go.temporal.io/server/common/metrics.NewServerMetricsContextInjectorInterceptor.func1
    /temporal/common/metrics/grpc.go:66
google.golang.org/grpc.chainUnaryInterceptors.func1.1
    /go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1116
go.temporal.io/server/common/rpc.ServiceErrorInterceptor
    /temporal/common/rpc/grpc.go:131
google.golang.org/grpc.chainUnaryInterceptors.func1.1
    /go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1116
go.temporal.io/server/common/rpc/interceptor.(*NamespaceLogInterceptor).Intercept
    /temporal/common/rpc/interceptor/namespace_logger.go:83
google.golang.org/grpc.chainUnaryInterceptors.func1.1
    /go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1116
google.golang.org/grpc.chainUnaryInterceptors.func1
    /go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1118
go.temporal.io/api/workflowservice/v1._WorkflowService_PollWorkflowTaskQueue_Handler
    /go/pkg/mod/go.temporal.io/api@v1.5.1-0.20211018190919-a5f4a169cd08/workflowservice/v1/service.pb.go:1090
google.golang.org/grpc.(*Server).processUnaryRPC
    /go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1279
google.golang.org/grpc.(*Server).handleStream
    /go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1608
google.golang.org/grpc.(*Server).serveStreams.func1.2
    /go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:923"
1 Like

@wes.dawoud

PollWorkflowTaskQueue timeout errors are typically benign and can be ignored. They happen when there are no tasks on the task queue that workers (can include system workers) are polling on and the frontend service times out on the poll request.

Can you check if you are getting any errors in logs for your matching/history services and your db?

The command tctl cluster health gets:
temporal.api.workflowservice.v1.WorkflowService: SERVING

this command does health check for frontend service only which seems to be up.

For matching and history service you could use grpc-health-probe, for example:

./grpc-health-probe -addr=addrs:port -service=temporal.api.workflowservice.v1.MatchingService
./grpc-health-probe -addr=addrs:port -service=temporal.api.workflowservice.v1.HistoryService
1 Like

β€œ504 Gateway Time-out.”

Do you have ingress controller for frontend service? Do you set up TLS? Might be worth checking logs here as well.

Thanks @tihomir for the reply!
The volume of the error logs is very huge to ignore. Additionally, the whole cluster gets into none operational state (i.e. multiple timeouts and many workflows stuck in the β€œRunning” state for a very long time).

The screenshot shows the error logs by history pods before, during, and after the issue. The issue is solved by wiping out the schema with all the stuck workflows and creating a new empty one.

The errors mostly are coming from history pods, and they look like the following:

{"level":"error","ts":"2022-05-23T06:59:59.952Z","msg":"Operation failed with internal error.","error":"GetTransferTasks operation failed. Select failed. Error: context deadline exceeded","metric-scope":15,"logging-call-at":"persistenceMetricClients.go:1229","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:142\ngo.temporal.io/server/common/persistence.(*metricEmitter).updateErrorMetric\n\t/temporal/common/persistence/persistenceMetricClients.go:1229\ngo.temporal.io/server/common/persistence.(*executionPersistenceClient).GetTransferTasks\n\t/temporal/common/persistence/persistenceMetricClients.go:349\ngo.temporal.io/server/service/history.(*transferQueueProcessorBase).readTasks\n\t/temporal/service/history/transferQueueProcessorBase.go:76\ngo.temporal.io/server/service/history.(*queueAckMgrImpl).readQueueTasks.func1\n\t/temporal/service/history/queueAckMgr.go:107\ngo.temporal.io/server/common/backoff.Retry.func1\n\t/temporal/common/backoff/retry.go:104\ngo.temporal.io/server/common/backoff.RetryContext\n\t/temporal/common/backoff/retry.go:125\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:105\ngo.temporal.io/server/service/history.(*queueAckMgrImpl).readQueueTasks\n\t/temporal/service/history/queueAckMgr.go:111\ngo.temporal.io/server/service/history.(*queueProcessorBase).processBatch\n\t/temporal/service/history/queueProcessor.go:263\ngo.temporal.io/server/service/history.(*queueProcessorBase).processorPump\n\t/temporal/service/history/queueProcessor.go:234"}

and

{"level":"error","ts":"2022-05-23T06:59:59.950Z","msg":"Operation failed with internal error.","error":"GetWorkflowExecution: failed. Error: context deadline exceeded","metric-scope":5,"logging-call-at":"persistenceMetricClients.go:1229","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:142\ngo.temporal.io/server/common/persistence.(*metricEmitter).updateErrorMetric\n\t/temporal/common/persistence/persistenceMetricClients.go:1229\ngo.temporal.io/server/common/persistence.(*executionPersistenceClient).GetWorkflowExecution\n\t/temporal/common/persistence/persistenceMetricClients.go:223\ngo.temporal.io/server/service/history/workflow.getWorkflowExecutionWithRetry.func1\n\t/temporal/service/history/workflow/transaction_impl.go:401\ngo.temporal.io/server/common/backoff.Retry.func1\n\t/temporal/common/backoff/retry.go:104\ngo.temporal.io/server/common/backoff.RetryContext\n\t/temporal/common/backoff/retry.go:125\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:105\ngo.temporal.io/server/service/history/workflow.getWorkflowExecutionWithRetry\n\t/temporal/service/history/workflow/transaction_impl.go:406\ngo.temporal.io/server/service/history/workflow.(*ContextImpl).LoadWorkflowExecution\n\t/temporal/service/history/workflow/context.go:338\ngo.temporal.io/server/service/history.loadMutableStateForTimerTask\n\t/temporal/service/history/nDCTaskUtil.go:132\ngo.temporal.io/server/service/history.(*timerQueueTaskExecutorBase).executeDeleteHistoryEventTask\n\t/temporal/service/history/timerQueueTaskExecutorBase.go:101\ngo.temporal.io/server/service/history.(*timerQueueActiveTaskExecutor).execute\n\t/temporal/service/history/timerQueueActiveTaskExecutor.go:109\ngo.temporal.io/server/service/history.(*timerQueueActiveProcessorImpl).process\n\t/temporal/service/history/timerQueueActiveProcessor.go:307\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskOnce\n\t/temporal/service/history/taskProcessor.go:269\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck.func1\n\t/temporal/service/history/taskProcessor.go:221\ngo.temporal.io/server/common/backoff.Retry.func1\n\t/temporal/common/backoff/retry.go:104\ngo.temporal.io/server/common/backoff.RetryContext\n\t/temporal/common/backoff/retry.go:125\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:105\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck\n\t/temporal/service/history/taskProcessor.go:248\ngo.temporal.io/server/service/history.(*taskProcessor).taskWorker\n\t/temporal/service/history/taskProcessor.go:171"}

and

{"level":"error","ts":"2022-05-23T06:59:59.950Z","msg":"Critical error processing task, retrying.","service":"history","shard-id":2519,"address":"10.1.55.133:7234","shard-item":"0xc00f244500","component":"timer-queue-processor","cluster-name":"active","component":"timer-queue-processor","shard-id":2519,"queue-task-id":1049736,"queue-task-visibility-timestamp":"2022-05-22T18:04:13.197Z","xdc-failover-version":0,"queue-task-type":"DeleteHistoryEvent","wf-namespace-id":"9c456370-c018-4201-a73a-7088021c60b9","wf-id":"CustomerProfilexxxx:d1233d09-1f36-4179-8d80-ff5e6d44a644:2022-04-22T14:04:08.211-04:00","wf-run-id":"8563a26a-2091-42cf-b89b-3b0cf09549bb","error":"GetWorkflowExecution: failed. Error: context deadline exceeded","operation-result":"OperationCritical","queue-task-type":"DeleteHistoryEvent","logging-call-at":"taskProcessor.go:227","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:142\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck.func1\n\t/temporal/service/history/taskProcessor.go:227\ngo.temporal.io/server/common/backoff.Retry.func1\n\t/temporal/common/backoff/retry.go:104\ngo.temporal.io/server/common/backoff.RetryContext\n\t/temporal/common/backoff/retry.go:125\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:105\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck\n\t/temporal/service/history/taskProcessor.go:248\ngo.temporal.io/server/service/history.(*taskProcessor).taskWorker\n\t/temporal/service/history/taskProcessor.go:171"}

and

{"level":"error","ts":"2022-05-23T06:59:59.950Z","msg":"Error updating ack level for shard","service":"history","shard-id":1186,"address":"10.1.55.133:7234","shard-item":"0xc025d9f800","component":"visibility-queue-processor","error":"UpdateShard failed. Failed to start transaction. Error: context deadline exceeded","operation-result":"OperationFailed","logging-call-at":"queueAckMgr.go:224","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:142\ngo.temporal.io/server/service/history.(*queueAckMgrImpl).updateQueueAckLevel\n\t/temporal/service/history/queueAckMgr.go:224\ngo.temporal.io/server/service/history.(*queueProcessorBase).processorPump\n\t/temporal/service/history/queueProcessor.go:241"}

The database was reachable from the pods all the time, no particular logs in the database other than a more frequent connection and termination of the connections. The CPU shows a spike during the issue.

The issue is solved after clearing up the database without any changes to the cluster setup. So, I think it was more related to the ability of the cluster to serve any requests during the incident.

It could be worth mentioning that we are on the following versions:
server:1.13.0
admin-tools:1.13.0
web:1.12.0