Temporal History Server Errors

Hello,

We’ve been having the following recurrent error pattern showing up on our development Temporal cluster happening every few seconds on one of the history servers:

{"level":"error","ts":"2021-09-08T13:13:43.429Z","msg":"Cache unable to retrieve event from store","service":"history","shard-id":264,"address":"10.151.190.180:7234","shard-item":"0xc001aec700","component":"events-cache","error":"Workflow execution history not found.","wf-id":"temporal-sys-tq-scanner","wf-run-id":"415ee122-2bd4-4cc0-9839-882024a4fa36","wf-namespace-id":"32049b68-7872-4094-8e63-d0dd59896a83","wf-history-event-id":1,"logging-call-at":"cache.go:142","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:143\ngo.temporal.io/server/service/history/events.(*CacheImpl).GetEvent\n\t/temporal/service/history/events/cache.go:142\ngo.temporal.io/server/service/history/workflow.(*MutableStateImpl).GetStartEvent\n\t/temporal/service/history/workflow/mutable_state_impl.go:752\ngo.temporal.io/server/service/history/workflow.newMutableStateBuilderFromDB\n\t/temporal/service/history/workflow/mutable_state_impl.go:323\ngo.temporal.io/server/service/history/workflow.(*ContextImpl).LoadWorkflowExecution\n\t/temporal/service/history/workflow/context.go:359\ngo.temporal.io/server/service/history.loadMutableStateForTimerTask\n\t/temporal/service/history/nDCTaskUtil.go:132\ngo.temporal.io/server/service/history.(*timerQueueActiveTaskExecutor).executeWorkflowTaskTimeoutTask\n\t/temporal/service/history/timerQueueActiveTaskExecutor.go:297\ngo.temporal.io/server/service/history.(*timerQueueActiveTaskExecutor).execute\n\t/temporal/service/history/timerQueueActiveTaskExecutor.go:99\ngo.temporal.io/server/service/history.(*timerQueueActiveProcessorImpl).process\n\t/temporal/service/history/timerQueueActiveProcessor.go:304\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskOnce\n\t/temporal/service/history/taskProcessor.go:264\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck.func1\n\t/temporal/service/history/taskProcessor.go:217\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:103\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck\n\t/temporal/service/history/taskProcessor.go:244\ngo.temporal.io/server/service/history.(*taskProcessor).taskWorker\n\t/temporal/service/history/taskProcessor.go:167"}
{"level":"error","ts":"2021-09-08T13:13:43.429Z","msg":"Fail to process task","service":"history","shard-id":264,"address":"10.151.190.180:7234","shard-item":"0xc001aec700","component":"timer-queue-processor","cluster-name":"active","component":"timer-queue-processor","shard-id":264,"queue-task-id":2197384,"queue-task-visibility-timestamp":"2020-10-28T00:00:33.426Z","xdc-failover-version":0,"queue-task-type":"WorkflowTaskTimeout","wf-namespace-id":"32049b68-7872-4094-8e63-d0dd59896a83","wf-id":"temporal-sys-tq-scanner","wf-run-id":"415ee122-2bd4-4cc0-9839-882024a4fa36","error":"unable to get workflow start event","lifecycle":"ProcessingFailed","logging-call-at":"taskProcessor.go:332","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:143\ngo.temporal.io/server/service/history.(*taskProcessor).handleTaskError\n\t/temporal/service/history/taskProcessor.go:332\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck.func1\n\t/temporal/service/history/taskProcessor.go:218\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:103\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck\n\t/temporal/service/history/taskProcessor.go:244\ngo.temporal.io/server/service/history.(*taskProcessor).taskWorker\n\t/temporal/service/history/taskProcessor.go:167"}
{"level":"error","ts":"2021-09-08T13:13:43.429Z","msg":"Critical error processing task, retrying.","service":"history","shard-id":264,"address":"10.151.190.180:7234","shard-item":"0xc001aec700","component":"timer-queue-processor","cluster-name":"active","component":"timer-queue-processor","shard-id":264,"queue-task-id":2197384,"queue-task-visibility-timestamp":"2020-10-28T00:00:33.426Z","xdc-failover-version":0,"queue-task-type":"WorkflowTaskTimeout","wf-namespace-id":"32049b68-7872-4094-8e63-d0dd59896a83","wf-id":"temporal-sys-tq-scanner","wf-run-id":"415ee122-2bd4-4cc0-9839-882024a4fa36","error":"unable to get workflow start event","operation-result":"OperationCritical","queue-task-type":"WorkflowTaskTimeout","logging-call-at":"taskProcessor.go:223","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:143\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck.func1\n\t/temporal/service/history/taskProcessor.go:223\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:103\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck\n\t/temporal/service/history/taskProcessor.go:244\ngo.temporal.io/server/service/history.(*taskProcessor).taskWorker\n\t/temporal/service/history/taskProcessor.go:167"}

The cluster was initially deployed on version 1.0.0, then migrated to 1.6.7, and just upgraded to 1.11.4. Unfortunately, I do not remember exactly when the error first occurred.

Do you have any idea why this is happening or how to fix it?

Thanks,
Andrei

EDIT: Listing the workflows running in the temporal-system namespacedoes not yield any results - tctl -ns temporal-system wf list.

Are you upgrading directly from 1.6.7 to 1.11.4? It is not supported. We support and test only upgrades between minor versions. So you need to upgrade to 1.7.x first, then 1.8.x, and so on. But if it is development cluster, I guess it would be easier for you to recreate database and start with latest 1.12.

Also it seems that you have task in queue which is getting retried since "queue-task-visibility-timestamp":"2020-10-28T00:00:33.426Z". I definitely would recommend you to start with fresh database.

Would you see any problems upgrading directly as long as I can scale down the cluster and bring it back up after the schema upgrade?

Thanks for the answer! Yes, indeed, we can start with a fresh database on the development cluster.

It is not about schema only but also about data. Sometimes we need to make changes which modifies existing data and we usually ask people to run every version for some period of time (i.e. 1 hour) before upgrading to the next one.

I’ve checked the migrations that are run. Do you further migrate data in the code after each upgrade of the server?

Yes, sometimes we do.

This is a very interesting thread! I feel like the following statement from the documentation needs to be expanded a bit (it’s pretty terse) and maybe made more obvious or visible, so that users are more sure to notice it when learning how to manage a temporal environment:

We ensure that any consecutive versions are compatible in terms of database schema upgrades, features, and system behavior, however there is no guarantee that there is compatibility between any 2 non-consecutive versions.

Along these same lines, I am seeing errors like these in our logs every few seconds:

{"level":"error","ts":1632341745.837787,"logger":"temporal","caller":"log/zap_logger.go:143","msg":"Critical error processing task, retrying.","service":"history","shard-id":43,"address":"10.41.104.30:7234","shard-item":"0xc001b0db00","component":"transfer-queue-processor","cluster-name":"active","shard-id":43,"queue-task-id":398458925,"queue-task-visibility-timestamp":1632030153.8976693,"xdc-failover-version":0,"queue-task-type":"TransferCloseExecution","wf-namespace-id":"a4ab2d1e-4d2e-4af1-b1ed-b2b9a50fdb63","wf-id":"ede3b007-58e4-4347-90ea-2e3b8ce222ac","wf-run-id":"3610475b-ca90-451e-b579-9b0541ea0a67","error":"context deadline exceeded","operation-result":"OperationCritical","queue-task-type":"TransferCloseExecution","logging-call-at":"taskProcessor.go:227","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/go/pkg/mod/go.temporal.io/server@v1.12.1/common/log/zap_logger.go:143\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck.func1\n\t/go/pkg/mod/go.temporal.io/server@v1.12.1/service/history/taskProcessor.go:227\ngo.temporal.io/server/common/backoff.Retry.func1\n\t/go/pkg/mod/go.temporal.io/server@v1.12.1/common/backoff/retry.go:104\ngo.temporal.io/server/common/backoff.RetryContext\n\t/go/pkg/mod/go.temporal.io/server@v1.12.1/common/backoff/retry.go:125\ngo.temporal.io/server/common/backoff.Retry\n\t/go/pkg/mod/go.temporal.io/server@v1.12.1/common/backoff/retry.go:105\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck\n\t/go/pkg/mod/go.temporal.io/server@v1.12.1/service/history/taskProcessor.go:248\ngo.temporal.io/server/service/history.(*taskProcessor).taskWorker\n\t/go/pkg/mod/go.temporal.io/server@v1.12.1/service/history/taskProcessor.go:171"}
{"level":"error","ts":1632341745.8375838,"logger":"temporal","caller":"log/zap_logger.go:143","msg":"Fail to process task","service":"history","shard-id":43,"address":"10.41.104.30:7234","shard-item":"0xc001b0db00","component":"transfer-queue-processor","cluster-name":"active","shard-id":43,"queue-task-id":398458925,"queue-task-visibility-timestamp":1632030153.8976693,"xdc-failover-version":0,"queue-task-type":"TransferCloseExecution","wf-namespace-id":"a4ab2d1e-4d2e-4af1-b1ed-b2b9a50fdb63","wf-id":"ede3b007-58e4-4347-90ea-2e3b8ce222ac","wf-run-id":"3610475b-ca90-451e-b579-9b0541ea0a67","error":"context deadline exceeded","lifecycle":"ProcessingFailed","logging-call-at":"taskProcessor.go:341","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/go/pkg/mod/go.temporal.io/server@v1.12.1/common/log/zap_logger.go:143\ngo.temporal.io/server/service/history.(*taskProcessor).handleTaskError\n\t/go/pkg/mod/go.temporal.io/server@v1.12.1/service/history/taskProcessor.go:341\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck.func1\n\t/go/pkg/mod/go.temporal.io/server@v1.12.1/service/history/taskProcessor.go:222\ngo.temporal.io/server/common/backoff.Retry.func1\n\t/go/pkg/mod/go.temporal.io/server@v1.12.1/common/backoff/retry.go:104\ngo.temporal.io/server/common/backoff.RetryContext\n\t/go/pkg/mod/go.temporal.io/server@v1.12.1/common/backoff/retry.go:125\ngo.temporal.io/server/common/backoff.Retry\n\t/go/pkg/mod/go.temporal.io/server@v1.12.1/common/backoff/retry.go:105\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck\n\t/go/pkg/mod/go.temporal.io/server@v1.12.1/service/history/taskProcessor.go:248\ngo.temporal.io/server/service/history.(*taskProcessor).taskWorker\n\t/go/pkg/mod/go.temporal.io/server@v1.12.1/service/history/taskProcessor.go:171"}

This is on a Cassandra dev database, and we’re running 1.12.1.

The interesting bit is all of the workflow IDs I’m seeing reported by these errors are for executions that failed with a timeout many hours ago, and all of these timeouts happened while waiting for a child workflow to finish. Is there any reason that kind of scenario might cause this error loop?