Hello,
We’ve been having the following recurrent error pattern showing up on our development Temporal cluster happening every few seconds on one of the history servers:
{"level":"error","ts":"2021-09-08T13:13:43.429Z","msg":"Cache unable to retrieve event from store","service":"history","shard-id":264,"address":"10.151.190.180:7234","shard-item":"0xc001aec700","component":"events-cache","error":"Workflow execution history not found.","wf-id":"temporal-sys-tq-scanner","wf-run-id":"415ee122-2bd4-4cc0-9839-882024a4fa36","wf-namespace-id":"32049b68-7872-4094-8e63-d0dd59896a83","wf-history-event-id":1,"logging-call-at":"cache.go:142","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:143\ngo.temporal.io/server/service/history/events.(*CacheImpl).GetEvent\n\t/temporal/service/history/events/cache.go:142\ngo.temporal.io/server/service/history/workflow.(*MutableStateImpl).GetStartEvent\n\t/temporal/service/history/workflow/mutable_state_impl.go:752\ngo.temporal.io/server/service/history/workflow.newMutableStateBuilderFromDB\n\t/temporal/service/history/workflow/mutable_state_impl.go:323\ngo.temporal.io/server/service/history/workflow.(*ContextImpl).LoadWorkflowExecution\n\t/temporal/service/history/workflow/context.go:359\ngo.temporal.io/server/service/history.loadMutableStateForTimerTask\n\t/temporal/service/history/nDCTaskUtil.go:132\ngo.temporal.io/server/service/history.(*timerQueueActiveTaskExecutor).executeWorkflowTaskTimeoutTask\n\t/temporal/service/history/timerQueueActiveTaskExecutor.go:297\ngo.temporal.io/server/service/history.(*timerQueueActiveTaskExecutor).execute\n\t/temporal/service/history/timerQueueActiveTaskExecutor.go:99\ngo.temporal.io/server/service/history.(*timerQueueActiveProcessorImpl).process\n\t/temporal/service/history/timerQueueActiveProcessor.go:304\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskOnce\n\t/temporal/service/history/taskProcessor.go:264\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck.func1\n\t/temporal/service/history/taskProcessor.go:217\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:103\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck\n\t/temporal/service/history/taskProcessor.go:244\ngo.temporal.io/server/service/history.(*taskProcessor).taskWorker\n\t/temporal/service/history/taskProcessor.go:167"}
{"level":"error","ts":"2021-09-08T13:13:43.429Z","msg":"Fail to process task","service":"history","shard-id":264,"address":"10.151.190.180:7234","shard-item":"0xc001aec700","component":"timer-queue-processor","cluster-name":"active","component":"timer-queue-processor","shard-id":264,"queue-task-id":2197384,"queue-task-visibility-timestamp":"2020-10-28T00:00:33.426Z","xdc-failover-version":0,"queue-task-type":"WorkflowTaskTimeout","wf-namespace-id":"32049b68-7872-4094-8e63-d0dd59896a83","wf-id":"temporal-sys-tq-scanner","wf-run-id":"415ee122-2bd4-4cc0-9839-882024a4fa36","error":"unable to get workflow start event","lifecycle":"ProcessingFailed","logging-call-at":"taskProcessor.go:332","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:143\ngo.temporal.io/server/service/history.(*taskProcessor).handleTaskError\n\t/temporal/service/history/taskProcessor.go:332\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck.func1\n\t/temporal/service/history/taskProcessor.go:218\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:103\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck\n\t/temporal/service/history/taskProcessor.go:244\ngo.temporal.io/server/service/history.(*taskProcessor).taskWorker\n\t/temporal/service/history/taskProcessor.go:167"}
{"level":"error","ts":"2021-09-08T13:13:43.429Z","msg":"Critical error processing task, retrying.","service":"history","shard-id":264,"address":"10.151.190.180:7234","shard-item":"0xc001aec700","component":"timer-queue-processor","cluster-name":"active","component":"timer-queue-processor","shard-id":264,"queue-task-id":2197384,"queue-task-visibility-timestamp":"2020-10-28T00:00:33.426Z","xdc-failover-version":0,"queue-task-type":"WorkflowTaskTimeout","wf-namespace-id":"32049b68-7872-4094-8e63-d0dd59896a83","wf-id":"temporal-sys-tq-scanner","wf-run-id":"415ee122-2bd4-4cc0-9839-882024a4fa36","error":"unable to get workflow start event","operation-result":"OperationCritical","queue-task-type":"WorkflowTaskTimeout","logging-call-at":"taskProcessor.go:223","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:143\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck.func1\n\t/temporal/service/history/taskProcessor.go:223\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:103\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck\n\t/temporal/service/history/taskProcessor.go:244\ngo.temporal.io/server/service/history.(*taskProcessor).taskWorker\n\t/temporal/service/history/taskProcessor.go:167"}
The cluster was initially deployed on version 1.0.0, then migrated to 1.6.7, and just upgraded to 1.11.4. Unfortunately, I do not remember exactly when the error first occurred.
Do you have any idea why this is happening or how to fix it?
Thanks,
Andrei
EDIT: Listing the workflows running in the temporal-system
namespacedoes not yield any results - tctl -ns temporal-system wf list
.