Persistence layer problems

I’m currently evaluating Temporal by processing larger batches of workflows through it. I’m using Postgres for persistency.

Sometimes, it ends up in a state where everything seems to be stuck. The database CPU utilization is very low (~1%), and Temporal is logging a lot of errors that seem to be related to the persistence layer. For example:

{ "msg":"Queue reader unable to retrieve tasks","shard-id":1142,"address":"10.0.41.70:7234","component":"visibility-queue-processor","queue-reader-id":0,"error":"GetVisibilityTasks operation failed. Select failed. Error: context deadline exceeded","logging-call-at":"reader.go:470", ... }
{ "msg":"Operation failed with internal error.","error":"GetVisibilityTasks operation failed. Select failed. Error: context deadline exceeded","operation":"GetVisibilityTasks","logging-call-at":"persistence_metric_clients.go:1281", ... }
{"msg":"Error range completing queue task","shard-id":2583,"address":"10.0.41.70:7234","component":"timer-queue-processor","error":"RangeCompleteTimerTask operation failed. Error: context deadline exceeded","logging-call-at":"queue_base.go:453", ... }
{"msg":"Operation failed with internal error.","error":"GetTransferTasks operation failed. Select failed. Error: context deadline exceeded","operation":"GetTransferTasks","logging-call-at":"persistence_metric_clients.go:1281", ... }
{"msg":"Operation failed with internal error.","error":"UpdateShard failed. Failed to start transaction. Error: context deadline exceeded","operation":"UpdateShard","logging-call-at":"persistence_metric_clients.go:1281" ... }

…and so on, there’s a lot more. It looks a bit like a deadlock somewhere in the database layer.

I’ve been wondering how I could try to find the cause of the problem.

I’m using version 1.23.0.

1 Like

I am facing the similar issue:
the logs from the history service look like this:

{"level":"error","ts":"2024-06-25T06:02:06.823Z","msg":"Error range completing queue task","shard-id":566,"address":"10.244.0.21:7234","component":"archival-queue-processor","error":"RangeCompleteHistoryTask operation failed. CategoryID: 5. Error: context deadline exceeded","logging-call-at":"queue_base.go:453","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:156\ngo.temporal.io/server/service/history/queues.(*queueBase).rangeCompleteTasks\n\t/home/builder/temporal/service/history/queues/queue_base.go:453\ngo.temporal.io/server/service/history/queues.(*queueBase).checkpoint\n\t/home/builder/temporal/service/history/queues/queue_base.go:367\ngo.temporal.io/server/service/history/queues.(*scheduledQueue).processEventLoop\n\t/home/builder/temporal/service/history/queues/queue_scheduled.go:221"}
{"level":"error","ts":"2024-06-25T06:01:11.059Z","msg":"Operation failed with internal error.","error":"GetTransferTasks operation failed. Select failed. Error: context deadline exceeded","operation":"GetTransferTasks","logging-call-at":"persistence_metric_clients.go:1281","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:156\ngo.temporal.io/server/common/persistence.updateErrorMetric\n\t/home/builder/temporal/common/persistence/persistence_metric_clients.go:1281\ngo.temporal.io/server/common/persistence.(*metricEmitter).recordRequestMetrics\n\t/home/builder/temporal/common/persistence/persistence_metric_clients.go:1258\ngo.temporal.io/server/common/persistence.(*executionPersistenceClient).GetHistoryTasks.func1\n\t/home/builder/temporal/common/persistence/persistence_metric_clients.go:400\ngo.temporal.io/server/common/persistence.(*executionPersistenceClient).GetHistoryTasks\n\t/home/builder/temporal/common/persistence/persistence_metric_clients.go:402\ngo.temporal.io/server/common/persistence.(*executionRetryablePersistenceClient).GetHistoryTasks.func1\n\t/home/builder/temporal/common/persistence/persistence_retryable_clients.go:380\ngo.temporal.io/server/common/backoff.ThrottleRetryContext\n\t/home/builder/temporal/common/backoff/retry.go:143\ngo.temporal.io/server/common/persistence.(*executionRetryablePersistenceClient).GetHistoryTasks\n\t/home/builder/temporal/common/persistence/persistence_retryable_clients.go:384\ngo.temporal.io/server/service/history/queues.NewImmediateQueue.func1.1\n\t/home/builder/temporal/service/history/queues/queue_immediate.go:79\ngo.temporal.io/server/common/collection.(*PagingIteratorImpl[...]).getNextPage\n\t/home/builder/temporal/common/collection/paging_iterator.go:116\ngo.temporal.io/server/common/collection.NewPagingIterator[...]\n\t/home/builder/temporal/common/collection/paging_iterator.go:52\ngo.temporal.io/server/service/history/queues.(*IteratorImpl).HasNext\n\t/home/builder/temporal/service/history/queues/iterator.go:74\ngo.temporal.io/server/service/history/queues.(*SliceImpl).SelectTasks\n\t/home/builder/temporal/service/history/queues/slice.go:373\ngo.temporal.io/server/service/history/queues.(*ReaderImpl).loadAndSubmitTasks\n\t/home/builder/temporal/service/history/queues/reader.go:468\ngo.temporal.io/server/service/history/queues.(*ReaderImpl).eventLoop\n\t/home/builder/temporal/service/history/queues/reader.go:436"}
{"level":"error","ts":"2024-06-25T06:02:06.823Z","msg":"Queue reader unable to retrieve tasks","shard-id":399,"address":"10.244.0.21:7234","component":"transfer-queue-processor","queue-reader-id":0,"error":"GetTransferTasks operation failed. Select failed. Error: context deadline exceeded","logging-call-at":"reader.go:470","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:156\ngo.temporal.io/server/service/history/queues.(*ReaderImpl).loadAndSubmitTasks\n\t/home/builder/temporal/service/history/queues/reader.go:470\ngo.temporal.io/server/service/history/queues.(*ReaderImpl).eventLoop\n\t/home/builder/temporal/service/history/queues/reader.go:436"} 

the workflow client logs:

{"level":"error","ts":"2024-06-25T05:53:07.185Z","msg":"Operation failed with internal error.","error":"GetTimerTasks operation failed. Select failed. Error: context deadline exceeded","operation":"GetTimerTasks","logging-call-at":"persistence_metric_clients.go:1281","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:156\ngo.temporal.io/server/common/persistence.updateErrorMetric\n\t/home/builder/temporal/common/persistence/persistence_metric_clients.go:1281\ngo.temporal.io/server/common/persistence.(*metricEmitter).recordRequestMetrics\n\t/home/builder/temporal/common/persistence/persistence_metric_clients.go:1258\ngo.temporal.io/server/common/persistence.(*executionPersistenceClient).GetHistoryTasks.func1\n\t/home/builder/temporal/common/persistence/persistence_metric_clients.go:400\ngo.temporal.io/server/common/persistence.(*executionPersistenceClient).GetHistoryTasks\n\t/home/builder/temporal/common/persistence/persistence_metric_clients.go:402\ngo.temporal.io/server/common/persistence.(*executionRetryablePersistenceClient).GetHistoryTasks.func1\n\t/home/builder/temporal/common/persistence/persistence_retryable_clients.go:380\ngo.temporal.io/server/common/backoff.ThrottleRetryContext\n\t/home/builder/temporal/common/backoff/retry.go:143\ngo.temporal.io/server/common/persistence.(*executionRetryablePersistenceClient).GetHistoryTasks\n\t/home/builder/temporal/common/persistence/persistence_retryable_clients.go:384\ngo.temporal.io/server/service/history/queues.NewScheduledQueue.func1.1\n\t/home/builder/temporal/service/history/queues/queue_scheduled.go:97\ngo.temporal.io/server/common/collection.(*PagingIteratorImpl[...]).getNextPage\n\t/home/builder/temporal/common/collection/paging_iterator.go:116\ngo.temporal.io/server/common/collection.NewPagingIterator[...]\n\t/home/builder/temporal/common/collection/paging_iterator.go:52\ngo.temporal.io/server/service/history/queues.(*IteratorImpl).HasNext\n\t/home/builder/temporal/service/history/queues/iterator.go:74\ngo.temporal.io/server/service/history/queues.(*SliceImpl).SelectTasks\n\t/home/builder/temporal/service/history/queues/slice.go:373\ngo.temporal.io/server/service/history/queues.(*ReaderImpl).loadAndSubmitTasks\n\t/home/builder/temporal/service/history/queues/reader.go:468\ngo.temporal.io/server/service/history/queues.(*ReaderImpl).eventLoop\n\t/home/builder/temporal/service/history/queues/reader.go:436"}
{"level":"error","ts":"2024-06-25T05:53:07.206Z","msg":"Queue reader unable to retrieve tasks","shard-id":1595,"address":"10.244.0.21:7234","component":"timer-queue-processor","queue-reader-id":0,"error":"GetTimerTasks operation failed. Select failed. Error: context deadline exceeded","logging-call-at":"reader.go:470","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:156\ngo.temporal.io/server/service/history/queues.(*ReaderImpl).loadAndSubmitTasks\n\t/home/builder/temporal/service/history/queues/reader.go:470\ngo.temporal.io/server/service/history/queues.(*ReaderImpl).eventLoop\n\t/home/builder/temporal/service/history/queues/reader.go:436"} 

Persistence DB is Postgres

One more thing to add:
When using tctl client you can see there is a workflow run open, but when you describe the very same job the status is “completed”