DEADLINE_EXCEEDED: deadline exceeded after 9.999933037s

wes.dawoud · May 23, 2022, 1:10pm

We are experiencing an issue that affects the availability of the Temporal cluster. The error shows up in Temporal frontend pods.
The database looks ok but a bit overloaded while experiencing this issue.
The UI lists the workflows, but when clicking on any, it shows “504 Gateway Time-out.”
Restarting the whole cluster did not help.
The setup uses the helm chart.
Some workflows are stuck in a running state.

@maxim @tihomir, could you help please

The command tctl cluster health gets:
temporal.api.workflowservice.v1.WorkflowService: SERVING

"level":"error","ts":"2022-05-23T12:59:57.280Z","msg":"Unable to call matching.PollWorkflowTaskQueue.","service":"frontend","wf-task-queue-name":"temporal-archival-tq","timeout":"1m9.999572483s","error":"context deadline exceeded","logging-call-at":"workflowHandler.go:812
"stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error
    /temporal/common/log/zap_logger.go:142
go.temporal.io/server/service/frontend.(*WorkflowHandler).PollWorkflowTaskQueue
    /temporal/service/frontend/workflowHandler.go:812
go.temporal.io/server/service/frontend.(*DCRedirectionHandlerImpl).PollWorkflowTaskQueue.func2
    /temporal/service/frontend/dcRedirectionHandler.go:532
go.temporal.io/server/service/frontend.(*NoopRedirectionPolicy).WithNamespaceRedirect
    /temporal/service/frontend/dcRedirectionPolicy.go:118
go.temporal.io/server/service/frontend.(*DCRedirectionHandlerImpl).PollWorkflowTaskQueue
    /temporal/service/frontend/dcRedirectionHandler.go:528
go.temporal.io/api/workflowservice/v1._WorkflowService_PollWorkflowTaskQueue_Handler.func1
    /go/pkg/mod/go.temporal.io/api@v1.5.1-0.20211018190919-a5f4a169cd08/workflowservice/v1/service.pb.go:1088
go.temporal.io/server/common/authorization.(*interceptor).Interceptor
    /temporal/common/authorization/interceptor.go:152
google.golang.org/grpc.chainUnaryInterceptors.func1.1
    /go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1113
go.temporal.io/server/common/rpc/interceptor.(*NamespaceCountLimitInterceptor).Intercept
    /temporal/common/rpc/interceptor/namespace_count_limit.go:89
google.golang.org/grpc.chainUnaryInterceptors.func1.1
    /go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1116
go.temporal.io/server/common/rpc/interceptor.(*NamespaceRateLimitInterceptor).Intercept
    /temporal/common/rpc/interceptor/namespace_rate_limit.go:87
google.golang.org/grpc.chainUnaryInterceptors.func1.1
    /go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1116
go.temporal.io/server/common/rpc/interceptor.(*RateLimitInterceptor).Intercept
    /temporal/common/rpc/interceptor/rate_limit.go:83
google.golang.org/grpc.chainUnaryInterceptors.func1.1
    /go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1116
go.temporal.io/server/common/rpc/interceptor.(*TelemetryInterceptor).Intercept
    /temporal/common/rpc/interceptor/telemetry.go:108
google.golang.org/grpc.chainUnaryInterceptors.func1.1
    /go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1116
go.temporal.io/server/common/metrics.NewServerMetricsContextInjectorInterceptor.func1
    /temporal/common/metrics/grpc.go:66
google.golang.org/grpc.chainUnaryInterceptors.func1.1
    /go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1116
go.temporal.io/server/common/rpc.ServiceErrorInterceptor
    /temporal/common/rpc/grpc.go:131
google.golang.org/grpc.chainUnaryInterceptors.func1.1
    /go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1116
go.temporal.io/server/common/rpc/interceptor.(*NamespaceLogInterceptor).Intercept
    /temporal/common/rpc/interceptor/namespace_logger.go:83
google.golang.org/grpc.chainUnaryInterceptors.func1.1
    /go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1116
google.golang.org/grpc.chainUnaryInterceptors.func1
    /go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1118
go.temporal.io/api/workflowservice/v1._WorkflowService_PollWorkflowTaskQueue_Handler
    /go/pkg/mod/go.temporal.io/api@v1.5.1-0.20211018190919-a5f4a169cd08/workflowservice/v1/service.pb.go:1090
google.golang.org/grpc.(*Server).processUnaryRPC
    /go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1279
google.golang.org/grpc.(*Server).handleStream
    /go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1608
google.golang.org/grpc.(*Server).serveStreams.func1.2
    /go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:923"

tihomir · May 24, 2022, 4:01am

@wes.dawoud

PollWorkflowTaskQueue timeout errors are typically benign and can be ignored. They happen when there are no tasks on the task queue that workers (can include system workers) are polling on and the frontend service times out on the poll request.

Can you check if you are getting any errors in logs for your matching/history services and your db?

tihomir · May 24, 2022, 4:04am

The command tctl cluster health gets:
temporal.api.workflowservice.v1.WorkflowService: SERVING

this command does health check for frontend service only which seems to be up.

For matching and history service you could use grpc-health-probe, for example:

./grpc-health-probe -addr=addrs:port -service=temporal.api.workflowservice.v1.MatchingService
./grpc-health-probe -addr=addrs:port -service=temporal.api.workflowservice.v1.HistoryService

tihomir · May 24, 2022, 4:22am

“504 Gateway Time-out.”

Do you have ingress controller for frontend service? Do you set up TLS? Might be worth checking logs here as well.

wes.dawoud · May 25, 2022, 5:08pm

Thanks @tihomir for the reply!
The volume of the error logs is very huge to ignore. Additionally, the whole cluster gets into none operational state (i.e. multiple timeouts and many workflows stuck in the “Running” state for a very long time).

The screenshot shows the error logs by history pods before, during, and after the issue. The issue is solved by wiping out the schema with all the stuck workflows and creating a new empty one.

The errors mostly are coming from history pods, and they look like the following:

{"level":"error","ts":"2022-05-23T06:59:59.952Z","msg":"Operation failed with internal error.","error":"GetTransferTasks operation failed. Select failed. Error: context deadline exceeded","metric-scope":15,"logging-call-at":"persistenceMetricClients.go:1229","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:142\ngo.temporal.io/server/common/persistence.(*metricEmitter).updateErrorMetric\n\t/temporal/common/persistence/persistenceMetricClients.go:1229\ngo.temporal.io/server/common/persistence.(*executionPersistenceClient).GetTransferTasks\n\t/temporal/common/persistence/persistenceMetricClients.go:349\ngo.temporal.io/server/service/history.(*transferQueueProcessorBase).readTasks\n\t/temporal/service/history/transferQueueProcessorBase.go:76\ngo.temporal.io/server/service/history.(*queueAckMgrImpl).readQueueTasks.func1\n\t/temporal/service/history/queueAckMgr.go:107\ngo.temporal.io/server/common/backoff.Retry.func1\n\t/temporal/common/backoff/retry.go:104\ngo.temporal.io/server/common/backoff.RetryContext\n\t/temporal/common/backoff/retry.go:125\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:105\ngo.temporal.io/server/service/history.(*queueAckMgrImpl).readQueueTasks\n\t/temporal/service/history/queueAckMgr.go:111\ngo.temporal.io/server/service/history.(*queueProcessorBase).processBatch\n\t/temporal/service/history/queueProcessor.go:263\ngo.temporal.io/server/service/history.(*queueProcessorBase).processorPump\n\t/temporal/service/history/queueProcessor.go:234"}

and

{"level":"error","ts":"2022-05-23T06:59:59.950Z","msg":"Operation failed with internal error.","error":"GetWorkflowExecution: failed. Error: context deadline exceeded","metric-scope":5,"logging-call-at":"persistenceMetricClients.go:1229","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:142\ngo.temporal.io/server/common/persistence.(*metricEmitter).updateErrorMetric\n\t/temporal/common/persistence/persistenceMetricClients.go:1229\ngo.temporal.io/server/common/persistence.(*executionPersistenceClient).GetWorkflowExecution\n\t/temporal/common/persistence/persistenceMetricClients.go:223\ngo.temporal.io/server/service/history/workflow.getWorkflowExecutionWithRetry.func1\n\t/temporal/service/history/workflow/transaction_impl.go:401\ngo.temporal.io/server/common/backoff.Retry.func1\n\t/temporal/common/backoff/retry.go:104\ngo.temporal.io/server/common/backoff.RetryContext\n\t/temporal/common/backoff/retry.go:125\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:105\ngo.temporal.io/server/service/history/workflow.getWorkflowExecutionWithRetry\n\t/temporal/service/history/workflow/transaction_impl.go:406\ngo.temporal.io/server/service/history/workflow.(*ContextImpl).LoadWorkflowExecution\n\t/temporal/service/history/workflow/context.go:338\ngo.temporal.io/server/service/history.loadMutableStateForTimerTask\n\t/temporal/service/history/nDCTaskUtil.go:132\ngo.temporal.io/server/service/history.(*timerQueueTaskExecutorBase).executeDeleteHistoryEventTask\n\t/temporal/service/history/timerQueueTaskExecutorBase.go:101\ngo.temporal.io/server/service/history.(*timerQueueActiveTaskExecutor).execute\n\t/temporal/service/history/timerQueueActiveTaskExecutor.go:109\ngo.temporal.io/server/service/history.(*timerQueueActiveProcessorImpl).process\n\t/temporal/service/history/timerQueueActiveProcessor.go:307\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskOnce\n\t/temporal/service/history/taskProcessor.go:269\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck.func1\n\t/temporal/service/history/taskProcessor.go:221\ngo.temporal.io/server/common/backoff.Retry.func1\n\t/temporal/common/backoff/retry.go:104\ngo.temporal.io/server/common/backoff.RetryContext\n\t/temporal/common/backoff/retry.go:125\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:105\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck\n\t/temporal/service/history/taskProcessor.go:248\ngo.temporal.io/server/service/history.(*taskProcessor).taskWorker\n\t/temporal/service/history/taskProcessor.go:171"}

and

{"level":"error","ts":"2022-05-23T06:59:59.950Z","msg":"Critical error processing task, retrying.","service":"history","shard-id":2519,"address":"10.1.55.133:7234","shard-item":"0xc00f244500","component":"timer-queue-processor","cluster-name":"active","component":"timer-queue-processor","shard-id":2519,"queue-task-id":1049736,"queue-task-visibility-timestamp":"2022-05-22T18:04:13.197Z","xdc-failover-version":0,"queue-task-type":"DeleteHistoryEvent","wf-namespace-id":"9c456370-c018-4201-a73a-7088021c60b9","wf-id":"CustomerProfilexxxx:d1233d09-1f36-4179-8d80-ff5e6d44a644:2022-04-22T14:04:08.211-04:00","wf-run-id":"8563a26a-2091-42cf-b89b-3b0cf09549bb","error":"GetWorkflowExecution: failed. Error: context deadline exceeded","operation-result":"OperationCritical","queue-task-type":"DeleteHistoryEvent","logging-call-at":"taskProcessor.go:227","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:142\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck.func1\n\t/temporal/service/history/taskProcessor.go:227\ngo.temporal.io/server/common/backoff.Retry.func1\n\t/temporal/common/backoff/retry.go:104\ngo.temporal.io/server/common/backoff.RetryContext\n\t/temporal/common/backoff/retry.go:125\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:105\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck\n\t/temporal/service/history/taskProcessor.go:248\ngo.temporal.io/server/service/history.(*taskProcessor).taskWorker\n\t/temporal/service/history/taskProcessor.go:171"}

and

{"level":"error","ts":"2022-05-23T06:59:59.950Z","msg":"Error updating ack level for shard","service":"history","shard-id":1186,"address":"10.1.55.133:7234","shard-item":"0xc025d9f800","component":"visibility-queue-processor","error":"UpdateShard failed. Failed to start transaction. Error: context deadline exceeded","operation-result":"OperationFailed","logging-call-at":"queueAckMgr.go:224","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:142\ngo.temporal.io/server/service/history.(*queueAckMgrImpl).updateQueueAckLevel\n\t/temporal/service/history/queueAckMgr.go:224\ngo.temporal.io/server/service/history.(*queueProcessorBase).processorPump\n\t/temporal/service/history/queueProcessor.go:241"}

The database was reachable from the pods all the time, no particular logs in the database other than a more frequent connection and termination of the connections. The CPU shows a spike during the issue.

wes.dawoud · May 25, 2022, 5:12pm

The issue is solved after clearing up the database without any changes to the cluster setup. So, I think it was more related to the ability of the cluster to serve any requests during the incident.

wes.dawoud · May 25, 2022, 6:37pm

It could be worth mentioning that we are on the following versions:
server:1.13.0
admin-tools:1.13.0
web:1.12.0

Jaison_N_John · July 28, 2022, 10:28am

We have also been experiencing a similar issue when we query the workflow , but it returns a “context” deadline exceeded" error after approximately 10 seconds. It happens arbitrarily, but not all the time. A retry attempt usually gets us the right result without error.

Would the issue we are experiencing be due to the same issue @tihomir ?
Would you consider the retry approach to be a good course of action to keep the bus moving?

Is this timeout behavior configurable somehow? Need it be?

tihomir · July 31, 2022, 9:33pm

@Jaison_N_John could you please open a new topic for this question as it seems it might be bit different. In that issue can you also add full logs (look at frontend service logs as well) and server version used please?
Would help to also add if the workflows you are querying run any local activities.

Hardik_Sethi · July 13, 2023, 11:59am

can you please answer when we cleared the db which tables are you mentioning about ?

Topic		Replies	Views
Context deadline exceeded when trying to start workflow (v1.7.1) Community Support java-sdk	8	2010	April 18, 2024
Context deadline exceeded issue Community Support go-sdk	15	7265	November 20, 2024
Deadline exceeded error with helm charts deployment Community Support java-sdk , helm	7	6610	January 11, 2022
Unable to call matching.PollWorkflowTaskQueue."error":"context deadline exceeded", Community Support java-sdk	3	1351	April 19, 2024
Temporal cluster always seems to be out of resources but always seems healthy Community Support general-impl , tctl	4	1772	July 5, 2023

DEADLINE_EXCEEDED: deadline exceeded after 9.999933037s

Related topics