Temporal History Server Errors

andrei · September 8, 2021, 1:51pm

Hello,

We’ve been having the following recurrent error pattern showing up on our development Temporal cluster happening every few seconds on one of the history servers:

{"level":"error","ts":"2021-09-08T13:13:43.429Z","msg":"Cache unable to retrieve event from store","service":"history","shard-id":264,"address":"10.151.190.180:7234","shard-item":"0xc001aec700","component":"events-cache","error":"Workflow execution history not found.","wf-id":"temporal-sys-tq-scanner","wf-run-id":"415ee122-2bd4-4cc0-9839-882024a4fa36","wf-namespace-id":"32049b68-7872-4094-8e63-d0dd59896a83","wf-history-event-id":1,"logging-call-at":"cache.go:142","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:143\ngo.temporal.io/server/service/history/events.(*CacheImpl).GetEvent\n\t/temporal/service/history/events/cache.go:142\ngo.temporal.io/server/service/history/workflow.(*MutableStateImpl).GetStartEvent\n\t/temporal/service/history/workflow/mutable_state_impl.go:752\ngo.temporal.io/server/service/history/workflow.newMutableStateBuilderFromDB\n\t/temporal/service/history/workflow/mutable_state_impl.go:323\ngo.temporal.io/server/service/history/workflow.(*ContextImpl).LoadWorkflowExecution\n\t/temporal/service/history/workflow/context.go:359\ngo.temporal.io/server/service/history.loadMutableStateForTimerTask\n\t/temporal/service/history/nDCTaskUtil.go:132\ngo.temporal.io/server/service/history.(*timerQueueActiveTaskExecutor).executeWorkflowTaskTimeoutTask\n\t/temporal/service/history/timerQueueActiveTaskExecutor.go:297\ngo.temporal.io/server/service/history.(*timerQueueActiveTaskExecutor).execute\n\t/temporal/service/history/timerQueueActiveTaskExecutor.go:99\ngo.temporal.io/server/service/history.(*timerQueueActiveProcessorImpl).process\n\t/temporal/service/history/timerQueueActiveProcessor.go:304\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskOnce\n\t/temporal/service/history/taskProcessor.go:264\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck.func1\n\t/temporal/service/history/taskProcessor.go:217\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:103\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck\n\t/temporal/service/history/taskProcessor.go:244\ngo.temporal.io/server/service/history.(*taskProcessor).taskWorker\n\t/temporal/service/history/taskProcessor.go:167"}
{"level":"error","ts":"2021-09-08T13:13:43.429Z","msg":"Fail to process task","service":"history","shard-id":264,"address":"10.151.190.180:7234","shard-item":"0xc001aec700","component":"timer-queue-processor","cluster-name":"active","component":"timer-queue-processor","shard-id":264,"queue-task-id":2197384,"queue-task-visibility-timestamp":"2020-10-28T00:00:33.426Z","xdc-failover-version":0,"queue-task-type":"WorkflowTaskTimeout","wf-namespace-id":"32049b68-7872-4094-8e63-d0dd59896a83","wf-id":"temporal-sys-tq-scanner","wf-run-id":"415ee122-2bd4-4cc0-9839-882024a4fa36","error":"unable to get workflow start event","lifecycle":"ProcessingFailed","logging-call-at":"taskProcessor.go:332","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:143\ngo.temporal.io/server/service/history.(*taskProcessor).handleTaskError\n\t/temporal/service/history/taskProcessor.go:332\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck.func1\n\t/temporal/service/history/taskProcessor.go:218\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:103\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck\n\t/temporal/service/history/taskProcessor.go:244\ngo.temporal.io/server/service/history.(*taskProcessor).taskWorker\n\t/temporal/service/history/taskProcessor.go:167"}
{"level":"error","ts":"2021-09-08T13:13:43.429Z","msg":"Critical error processing task, retrying.","service":"history","shard-id":264,"address":"10.151.190.180:7234","shard-item":"0xc001aec700","component":"timer-queue-processor","cluster-name":"active","component":"timer-queue-processor","shard-id":264,"queue-task-id":2197384,"queue-task-visibility-timestamp":"2020-10-28T00:00:33.426Z","xdc-failover-version":0,"queue-task-type":"WorkflowTaskTimeout","wf-namespace-id":"32049b68-7872-4094-8e63-d0dd59896a83","wf-id":"temporal-sys-tq-scanner","wf-run-id":"415ee122-2bd4-4cc0-9839-882024a4fa36","error":"unable to get workflow start event","operation-result":"OperationCritical","queue-task-type":"WorkflowTaskTimeout","logging-call-at":"taskProcessor.go:223","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:143\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck.func1\n\t/temporal/service/history/taskProcessor.go:223\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:103\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck\n\t/temporal/service/history/taskProcessor.go:244\ngo.temporal.io/server/service/history.(*taskProcessor).taskWorker\n\t/temporal/service/history/taskProcessor.go:167"}

The cluster was initially deployed on version 1.0.0, then migrated to 1.6.7, and just upgraded to 1.11.4. Unfortunately, I do not remember exactly when the error first occurred.

Do you have any idea why this is happening or how to fix it?

Thanks,
Andrei

EDIT: Listing the workflows running in the temporal-system namespacedoes not yield any results - tctl -ns temporal-system wf list.

alex · September 8, 2021, 6:47pm

Are you upgrading directly from 1.6.7 to 1.11.4? It is not supported. We support and test only upgrades between minor versions. So you need to upgrade to 1.7.x first, then 1.8.x, and so on. But if it is development cluster, I guess it would be easier for you to recreate database and start with latest 1.12.

alex · September 8, 2021, 7:36pm

Also it seems that you have task in queue which is getting retried since "queue-task-visibility-timestamp":"2020-10-28T00:00:33.426Z". I definitely would recommend you to start with fresh database.

andrei · September 8, 2021, 8:03pm

Would you see any problems upgrading directly as long as I can scale down the cluster and bring it back up after the schema upgrade?

andrei · September 8, 2021, 8:26pm

Thanks for the answer! Yes, indeed, we can start with a fresh database on the development cluster.

alex · September 8, 2021, 8:53pm

It is not about schema only but also about data. Sometimes we need to make changes which modifies existing data and we usually ask people to run every version for some period of time (i.e. 1 hour) before upgrading to the next one.

andrei · September 9, 2021, 6:21am

I’ve checked the migrations that are run. Do you further migrate data in the code after each upgrade of the server?

alex · September 9, 2021, 10:12pm

Yes, sometimes we do.

mikedev9000 · September 10, 2021, 2:59pm

This is a very interesting thread! I feel like the following statement from the documentation needs to be expanded a bit (it’s pretty terse) and maybe made more obvious or visible, so that users are more sure to notice it when learning how to manage a temporal environment:

We ensure that any consecutive versions are compatible in terms of database schema upgrades, features, and system behavior, however there is no guarantee that there is compatibility between any 2 non-consecutive versions.

jmac · September 22, 2021, 8:33pm

Along these same lines, I am seeing errors like these in our logs every few seconds:

{"level":"error","ts":1632341745.837787,"logger":"temporal","caller":"log/zap_logger.go:143","msg":"Critical error processing task, retrying.","service":"history","shard-id":43,"address":"10.41.104.30:7234","shard-item":"0xc001b0db00","component":"transfer-queue-processor","cluster-name":"active","shard-id":43,"queue-task-id":398458925,"queue-task-visibility-timestamp":1632030153.8976693,"xdc-failover-version":0,"queue-task-type":"TransferCloseExecution","wf-namespace-id":"a4ab2d1e-4d2e-4af1-b1ed-b2b9a50fdb63","wf-id":"ede3b007-58e4-4347-90ea-2e3b8ce222ac","wf-run-id":"3610475b-ca90-451e-b579-9b0541ea0a67","error":"context deadline exceeded","operation-result":"OperationCritical","queue-task-type":"TransferCloseExecution","logging-call-at":"taskProcessor.go:227","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/go/pkg/mod/go.temporal.io/server@v1.12.1/common/log/zap_logger.go:143\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck.func1\n\t/go/pkg/mod/go.temporal.io/server@v1.12.1/service/history/taskProcessor.go:227\ngo.temporal.io/server/common/backoff.Retry.func1\n\t/go/pkg/mod/go.temporal.io/server@v1.12.1/common/backoff/retry.go:104\ngo.temporal.io/server/common/backoff.RetryContext\n\t/go/pkg/mod/go.temporal.io/server@v1.12.1/common/backoff/retry.go:125\ngo.temporal.io/server/common/backoff.Retry\n\t/go/pkg/mod/go.temporal.io/server@v1.12.1/common/backoff/retry.go:105\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck\n\t/go/pkg/mod/go.temporal.io/server@v1.12.1/service/history/taskProcessor.go:248\ngo.temporal.io/server/service/history.(*taskProcessor).taskWorker\n\t/go/pkg/mod/go.temporal.io/server@v1.12.1/service/history/taskProcessor.go:171"}
{"level":"error","ts":1632341745.8375838,"logger":"temporal","caller":"log/zap_logger.go:143","msg":"Fail to process task","service":"history","shard-id":43,"address":"10.41.104.30:7234","shard-item":"0xc001b0db00","component":"transfer-queue-processor","cluster-name":"active","shard-id":43,"queue-task-id":398458925,"queue-task-visibility-timestamp":1632030153.8976693,"xdc-failover-version":0,"queue-task-type":"TransferCloseExecution","wf-namespace-id":"a4ab2d1e-4d2e-4af1-b1ed-b2b9a50fdb63","wf-id":"ede3b007-58e4-4347-90ea-2e3b8ce222ac","wf-run-id":"3610475b-ca90-451e-b579-9b0541ea0a67","error":"context deadline exceeded","lifecycle":"ProcessingFailed","logging-call-at":"taskProcessor.go:341","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/go/pkg/mod/go.temporal.io/server@v1.12.1/common/log/zap_logger.go:143\ngo.temporal.io/server/service/history.(*taskProcessor).handleTaskError\n\t/go/pkg/mod/go.temporal.io/server@v1.12.1/service/history/taskProcessor.go:341\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck.func1\n\t/go/pkg/mod/go.temporal.io/server@v1.12.1/service/history/taskProcessor.go:222\ngo.temporal.io/server/common/backoff.Retry.func1\n\t/go/pkg/mod/go.temporal.io/server@v1.12.1/common/backoff/retry.go:104\ngo.temporal.io/server/common/backoff.RetryContext\n\t/go/pkg/mod/go.temporal.io/server@v1.12.1/common/backoff/retry.go:125\ngo.temporal.io/server/common/backoff.Retry\n\t/go/pkg/mod/go.temporal.io/server@v1.12.1/common/backoff/retry.go:105\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck\n\t/go/pkg/mod/go.temporal.io/server@v1.12.1/service/history/taskProcessor.go:248\ngo.temporal.io/server/service/history.(*taskProcessor).taskWorker\n\t/go/pkg/mod/go.temporal.io/server@v1.12.1/service/history/taskProcessor.go:171"}

This is on a Cassandra dev database, and we’re running 1.12.1.

The interesting bit is all of the workflow IDs I’m seeing reported by these errors are for executions that failed with a timeout many hours ago, and all of these timeouts happened while waiting for a child workflow to finish. Is there any reason that kind of scenario might cause this error loop?

Toan_Tran · September 4, 2024, 9:19am

hi @jmac
Do you resolve this issue?

Topic		Replies	Views
"Workflow execution history not found" errors Server Deployment	2	651	September 10, 2024
Errors on Temporal History Server Community Support history	3	558	July 4, 2023
Temporal Sever errors ; workflow failures and all request to history client failed Community Support	6	1189	May 8, 2023
History service errors Community Support general-impl	1	384	June 23, 2023
History server context deadline exceed errors every hour Community Support history	4	1079	October 27, 2022

Temporal History Server Errors

Related topics