DEADLINE_EXCEEDED: deadline exceeded after 9.999933037s

We are experiencing an issue that affects the availability of the Temporal cluster. The error shows up in Temporal frontend pods.
The database looks ok but a bit overloaded while experiencing this issue.
The UI lists the workflows, but when clicking on any, it shows β€œ504 Gateway Time-out.”
Restarting the whole cluster did not help.
The setup uses the helm chart.
Some workflows are stuck in a running state.

@maxim @tihomir, could you help please

The command tctl cluster health gets:
temporal.api.workflowservice.v1.WorkflowService: SERVING

"level":"error","ts":"2022-05-23T12:59:57.280Z","msg":"Unable to call matching.PollWorkflowTaskQueue.","service":"frontend","wf-task-queue-name":"temporal-archival-tq","timeout":"1m9.999572483s","error":"context deadline exceeded","logging-call-at":"workflowHandler.go:812
"stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error
    /temporal/common/log/zap_logger.go:142
go.temporal.io/server/service/frontend.(*WorkflowHandler).PollWorkflowTaskQueue
    /temporal/service/frontend/workflowHandler.go:812
go.temporal.io/server/service/frontend.(*DCRedirectionHandlerImpl).PollWorkflowTaskQueue.func2
    /temporal/service/frontend/dcRedirectionHandler.go:532
go.temporal.io/server/service/frontend.(*NoopRedirectionPolicy).WithNamespaceRedirect
    /temporal/service/frontend/dcRedirectionPolicy.go:118
go.temporal.io/server/service/frontend.(*DCRedirectionHandlerImpl).PollWorkflowTaskQueue
    /temporal/service/frontend/dcRedirectionHandler.go:528
go.temporal.io/api/workflowservice/v1._WorkflowService_PollWorkflowTaskQueue_Handler.func1
    /go/pkg/mod/go.temporal.io/api@v1.5.1-0.20211018190919-a5f4a169cd08/workflowservice/v1/service.pb.go:1088
go.temporal.io/server/common/authorization.(*interceptor).Interceptor
    /temporal/common/authorization/interceptor.go:152
google.golang.org/grpc.chainUnaryInterceptors.func1.1
    /go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1113
go.temporal.io/server/common/rpc/interceptor.(*NamespaceCountLimitInterceptor).Intercept
    /temporal/common/rpc/interceptor/namespace_count_limit.go:89
google.golang.org/grpc.chainUnaryInterceptors.func1.1
    /go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1116
go.temporal.io/server/common/rpc/interceptor.(*NamespaceRateLimitInterceptor).Intercept
    /temporal/common/rpc/interceptor/namespace_rate_limit.go:87
google.golang.org/grpc.chainUnaryInterceptors.func1.1
    /go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1116
go.temporal.io/server/common/rpc/interceptor.(*RateLimitInterceptor).Intercept
    /temporal/common/rpc/interceptor/rate_limit.go:83
google.golang.org/grpc.chainUnaryInterceptors.func1.1
    /go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1116
go.temporal.io/server/common/rpc/interceptor.(*TelemetryInterceptor).Intercept
    /temporal/common/rpc/interceptor/telemetry.go:108
google.golang.org/grpc.chainUnaryInterceptors.func1.1
    /go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1116
go.temporal.io/server/common/metrics.NewServerMetricsContextInjectorInterceptor.func1
    /temporal/common/metrics/grpc.go:66
google.golang.org/grpc.chainUnaryInterceptors.func1.1
    /go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1116
go.temporal.io/server/common/rpc.ServiceErrorInterceptor
    /temporal/common/rpc/grpc.go:131
google.golang.org/grpc.chainUnaryInterceptors.func1.1
    /go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1116
go.temporal.io/server/common/rpc/interceptor.(*NamespaceLogInterceptor).Intercept
    /temporal/common/rpc/interceptor/namespace_logger.go:83
google.golang.org/grpc.chainUnaryInterceptors.func1.1
    /go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1116
google.golang.org/grpc.chainUnaryInterceptors.func1
    /go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1118
go.temporal.io/api/workflowservice/v1._WorkflowService_PollWorkflowTaskQueue_Handler
    /go/pkg/mod/go.temporal.io/api@v1.5.1-0.20211018190919-a5f4a169cd08/workflowservice/v1/service.pb.go:1090
google.golang.org/grpc.(*Server).processUnaryRPC
    /go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1279
google.golang.org/grpc.(*Server).handleStream
    /go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1608
google.golang.org/grpc.(*Server).serveStreams.func1.2
    /go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:923"
1 Like

@wes.dawoud

PollWorkflowTaskQueue timeout errors are typically benign and can be ignored. They happen when there are no tasks on the task queue that workers (can include system workers) are polling on and the frontend service times out on the poll request.

Can you check if you are getting any errors in logs for your matching/history services and your db?

The command tctl cluster health gets:
temporal.api.workflowservice.v1.WorkflowService: SERVING

this command does health check for frontend service only which seems to be up.

For matching and history service you could use grpc-health-probe, for example:

./grpc-health-probe -addr=addrs:port -service=temporal.api.workflowservice.v1.MatchingService
./grpc-health-probe -addr=addrs:port -service=temporal.api.workflowservice.v1.HistoryService
1 Like

β€œ504 Gateway Time-out.”

Do you have ingress controller for frontend service? Do you set up TLS? Might be worth checking logs here as well.

Thanks @tihomir for the reply!
The volume of the error logs is very huge to ignore. Additionally, the whole cluster gets into none operational state (i.e. multiple timeouts and many workflows stuck in the β€œRunning” state for a very long time).

The screenshot shows the error logs by history pods before, during, and after the issue. The issue is solved by wiping out the schema with all the stuck workflows and creating a new empty one.

The errors mostly are coming from history pods, and they look like the following:

{"level":"error","ts":"2022-05-23T06:59:59.952Z","msg":"Operation failed with internal error.","error":"GetTransferTasks operation failed. Select failed. Error: context deadline exceeded","metric-scope":15,"logging-call-at":"persistenceMetricClients.go:1229","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:142\ngo.temporal.io/server/common/persistence.(*metricEmitter).updateErrorMetric\n\t/temporal/common/persistence/persistenceMetricClients.go:1229\ngo.temporal.io/server/common/persistence.(*executionPersistenceClient).GetTransferTasks\n\t/temporal/common/persistence/persistenceMetricClients.go:349\ngo.temporal.io/server/service/history.(*transferQueueProcessorBase).readTasks\n\t/temporal/service/history/transferQueueProcessorBase.go:76\ngo.temporal.io/server/service/history.(*queueAckMgrImpl).readQueueTasks.func1\n\t/temporal/service/history/queueAckMgr.go:107\ngo.temporal.io/server/common/backoff.Retry.func1\n\t/temporal/common/backoff/retry.go:104\ngo.temporal.io/server/common/backoff.RetryContext\n\t/temporal/common/backoff/retry.go:125\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:105\ngo.temporal.io/server/service/history.(*queueAckMgrImpl).readQueueTasks\n\t/temporal/service/history/queueAckMgr.go:111\ngo.temporal.io/server/service/history.(*queueProcessorBase).processBatch\n\t/temporal/service/history/queueProcessor.go:263\ngo.temporal.io/server/service/history.(*queueProcessorBase).processorPump\n\t/temporal/service/history/queueProcessor.go:234"}

and

{"level":"error","ts":"2022-05-23T06:59:59.950Z","msg":"Operation failed with internal error.","error":"GetWorkflowExecution: failed. Error: context deadline exceeded","metric-scope":5,"logging-call-at":"persistenceMetricClients.go:1229","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:142\ngo.temporal.io/server/common/persistence.(*metricEmitter).updateErrorMetric\n\t/temporal/common/persistence/persistenceMetricClients.go:1229\ngo.temporal.io/server/common/persistence.(*executionPersistenceClient).GetWorkflowExecution\n\t/temporal/common/persistence/persistenceMetricClients.go:223\ngo.temporal.io/server/service/history/workflow.getWorkflowExecutionWithRetry.func1\n\t/temporal/service/history/workflow/transaction_impl.go:401\ngo.temporal.io/server/common/backoff.Retry.func1\n\t/temporal/common/backoff/retry.go:104\ngo.temporal.io/server/common/backoff.RetryContext\n\t/temporal/common/backoff/retry.go:125\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:105\ngo.temporal.io/server/service/history/workflow.getWorkflowExecutionWithRetry\n\t/temporal/service/history/workflow/transaction_impl.go:406\ngo.temporal.io/server/service/history/workflow.(*ContextImpl).LoadWorkflowExecution\n\t/temporal/service/history/workflow/context.go:338\ngo.temporal.io/server/service/history.loadMutableStateForTimerTask\n\t/temporal/service/history/nDCTaskUtil.go:132\ngo.temporal.io/server/service/history.(*timerQueueTaskExecutorBase).executeDeleteHistoryEventTask\n\t/temporal/service/history/timerQueueTaskExecutorBase.go:101\ngo.temporal.io/server/service/history.(*timerQueueActiveTaskExecutor).execute\n\t/temporal/service/history/timerQueueActiveTaskExecutor.go:109\ngo.temporal.io/server/service/history.(*timerQueueActiveProcessorImpl).process\n\t/temporal/service/history/timerQueueActiveProcessor.go:307\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskOnce\n\t/temporal/service/history/taskProcessor.go:269\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck.func1\n\t/temporal/service/history/taskProcessor.go:221\ngo.temporal.io/server/common/backoff.Retry.func1\n\t/temporal/common/backoff/retry.go:104\ngo.temporal.io/server/common/backoff.RetryContext\n\t/temporal/common/backoff/retry.go:125\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:105\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck\n\t/temporal/service/history/taskProcessor.go:248\ngo.temporal.io/server/service/history.(*taskProcessor).taskWorker\n\t/temporal/service/history/taskProcessor.go:171"}

and

{"level":"error","ts":"2022-05-23T06:59:59.950Z","msg":"Critical error processing task, retrying.","service":"history","shard-id":2519,"address":"10.1.55.133:7234","shard-item":"0xc00f244500","component":"timer-queue-processor","cluster-name":"active","component":"timer-queue-processor","shard-id":2519,"queue-task-id":1049736,"queue-task-visibility-timestamp":"2022-05-22T18:04:13.197Z","xdc-failover-version":0,"queue-task-type":"DeleteHistoryEvent","wf-namespace-id":"9c456370-c018-4201-a73a-7088021c60b9","wf-id":"CustomerProfilexxxx:d1233d09-1f36-4179-8d80-ff5e6d44a644:2022-04-22T14:04:08.211-04:00","wf-run-id":"8563a26a-2091-42cf-b89b-3b0cf09549bb","error":"GetWorkflowExecution: failed. Error: context deadline exceeded","operation-result":"OperationCritical","queue-task-type":"DeleteHistoryEvent","logging-call-at":"taskProcessor.go:227","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:142\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck.func1\n\t/temporal/service/history/taskProcessor.go:227\ngo.temporal.io/server/common/backoff.Retry.func1\n\t/temporal/common/backoff/retry.go:104\ngo.temporal.io/server/common/backoff.RetryContext\n\t/temporal/common/backoff/retry.go:125\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:105\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck\n\t/temporal/service/history/taskProcessor.go:248\ngo.temporal.io/server/service/history.(*taskProcessor).taskWorker\n\t/temporal/service/history/taskProcessor.go:171"}

and

{"level":"error","ts":"2022-05-23T06:59:59.950Z","msg":"Error updating ack level for shard","service":"history","shard-id":1186,"address":"10.1.55.133:7234","shard-item":"0xc025d9f800","component":"visibility-queue-processor","error":"UpdateShard failed. Failed to start transaction. Error: context deadline exceeded","operation-result":"OperationFailed","logging-call-at":"queueAckMgr.go:224","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:142\ngo.temporal.io/server/service/history.(*queueAckMgrImpl).updateQueueAckLevel\n\t/temporal/service/history/queueAckMgr.go:224\ngo.temporal.io/server/service/history.(*queueProcessorBase).processorPump\n\t/temporal/service/history/queueProcessor.go:241"}

The database was reachable from the pods all the time, no particular logs in the database other than a more frequent connection and termination of the connections. The CPU shows a spike during the issue.

The issue is solved after clearing up the database without any changes to the cluster setup. So, I think it was more related to the ability of the cluster to serve any requests during the incident.

It could be worth mentioning that we are on the following versions:
server:1.13.0
admin-tools:1.13.0
web:1.12.0

We have also been experiencing a similar issue when we query the workflow , but it returns a β€œcontext” deadline exceeded" error after approximately 10 seconds. It happens arbitrarily, but not all the time. A retry attempt usually gets us the right result without error.

Would the issue we are experiencing be due to the same issue @tihomir ?
Would you consider the retry approach to be a good course of action to keep the bus moving?

Is this timeout behavior configurable somehow? Need it be?

@Jaison_N_John could you please open a new topic for this question as it seems it might be bit different. In that issue can you also add full logs (look at frontend service logs as well) and server version used please?
Would help to also add if the workflows you are querying run any local activities.

can you please answer when we cleared the db which tables are you mentioning about ?