Persistence layer problems

I’m currently evaluating Temporal by processing larger batches of workflows through it. I’m using Postgres for persistency.

Sometimes, it ends up in a state where everything seems to be stuck. The database CPU utilization is very low (~1%), and Temporal is logging a lot of errors that seem to be related to the persistence layer. For example:

{ "msg":"Queue reader unable to retrieve tasks","shard-id":1142,"address":"10.0.41.70:7234","component":"visibility-queue-processor","queue-reader-id":0,"error":"GetVisibilityTasks operation failed. Select failed. Error: context deadline exceeded","logging-call-at":"reader.go:470", ... }
{ "msg":"Operation failed with internal error.","error":"GetVisibilityTasks operation failed. Select failed. Error: context deadline exceeded","operation":"GetVisibilityTasks","logging-call-at":"persistence_metric_clients.go:1281", ... }
{"msg":"Error range completing queue task","shard-id":2583,"address":"10.0.41.70:7234","component":"timer-queue-processor","error":"RangeCompleteTimerTask operation failed. Error: context deadline exceeded","logging-call-at":"queue_base.go:453", ... }
{"msg":"Operation failed with internal error.","error":"GetTransferTasks operation failed. Select failed. Error: context deadline exceeded","operation":"GetTransferTasks","logging-call-at":"persistence_metric_clients.go:1281", ... }
{"msg":"Operation failed with internal error.","error":"UpdateShard failed. Failed to start transaction. Error: context deadline exceeded","operation":"UpdateShard","logging-call-at":"persistence_metric_clients.go:1281" ... }

…and so on, there’s a lot more. It looks a bit like a deadlock somewhere in the database layer.

I’ve been wondering how I could try to find the cause of the problem.

I’m using version 1.23.0.

1 Like

I am facing the similar issue:
the logs from the history service look like this:

{"level":"error","ts":"2024-06-25T06:02:06.823Z","msg":"Error range completing queue task","shard-id":566,"address":"10.244.0.21:7234","component":"archival-queue-processor","error":"RangeCompleteHistoryTask operation failed. CategoryID: 5. Error: context deadline exceeded","logging-call-at":"queue_base.go:453","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:156\ngo.temporal.io/server/service/history/queues.(*queueBase).rangeCompleteTasks\n\t/home/builder/temporal/service/history/queues/queue_base.go:453\ngo.temporal.io/server/service/history/queues.(*queueBase).checkpoint\n\t/home/builder/temporal/service/history/queues/queue_base.go:367\ngo.temporal.io/server/service/history/queues.(*scheduledQueue).processEventLoop\n\t/home/builder/temporal/service/history/queues/queue_scheduled.go:221"}
{"level":"error","ts":"2024-06-25T06:01:11.059Z","msg":"Operation failed with internal error.","error":"GetTransferTasks operation failed. Select failed. Error: context deadline exceeded","operation":"GetTransferTasks","logging-call-at":"persistence_metric_clients.go:1281","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:156\ngo.temporal.io/server/common/persistence.updateErrorMetric\n\t/home/builder/temporal/common/persistence/persistence_metric_clients.go:1281\ngo.temporal.io/server/common/persistence.(*metricEmitter).recordRequestMetrics\n\t/home/builder/temporal/common/persistence/persistence_metric_clients.go:1258\ngo.temporal.io/server/common/persistence.(*executionPersistenceClient).GetHistoryTasks.func1\n\t/home/builder/temporal/common/persistence/persistence_metric_clients.go:400\ngo.temporal.io/server/common/persistence.(*executionPersistenceClient).GetHistoryTasks\n\t/home/builder/temporal/common/persistence/persistence_metric_clients.go:402\ngo.temporal.io/server/common/persistence.(*executionRetryablePersistenceClient).GetHistoryTasks.func1\n\t/home/builder/temporal/common/persistence/persistence_retryable_clients.go:380\ngo.temporal.io/server/common/backoff.ThrottleRetryContext\n\t/home/builder/temporal/common/backoff/retry.go:143\ngo.temporal.io/server/common/persistence.(*executionRetryablePersistenceClient).GetHistoryTasks\n\t/home/builder/temporal/common/persistence/persistence_retryable_clients.go:384\ngo.temporal.io/server/service/history/queues.NewImmediateQueue.func1.1\n\t/home/builder/temporal/service/history/queues/queue_immediate.go:79\ngo.temporal.io/server/common/collection.(*PagingIteratorImpl[...]).getNextPage\n\t/home/builder/temporal/common/collection/paging_iterator.go:116\ngo.temporal.io/server/common/collection.NewPagingIterator[...]\n\t/home/builder/temporal/common/collection/paging_iterator.go:52\ngo.temporal.io/server/service/history/queues.(*IteratorImpl).HasNext\n\t/home/builder/temporal/service/history/queues/iterator.go:74\ngo.temporal.io/server/service/history/queues.(*SliceImpl).SelectTasks\n\t/home/builder/temporal/service/history/queues/slice.go:373\ngo.temporal.io/server/service/history/queues.(*ReaderImpl).loadAndSubmitTasks\n\t/home/builder/temporal/service/history/queues/reader.go:468\ngo.temporal.io/server/service/history/queues.(*ReaderImpl).eventLoop\n\t/home/builder/temporal/service/history/queues/reader.go:436"}
{"level":"error","ts":"2024-06-25T06:02:06.823Z","msg":"Queue reader unable to retrieve tasks","shard-id":399,"address":"10.244.0.21:7234","component":"transfer-queue-processor","queue-reader-id":0,"error":"GetTransferTasks operation failed. Select failed. Error: context deadline exceeded","logging-call-at":"reader.go:470","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:156\ngo.temporal.io/server/service/history/queues.(*ReaderImpl).loadAndSubmitTasks\n\t/home/builder/temporal/service/history/queues/reader.go:470\ngo.temporal.io/server/service/history/queues.(*ReaderImpl).eventLoop\n\t/home/builder/temporal/service/history/queues/reader.go:436"} 

the workflow client logs:

{"level":"error","ts":"2024-06-25T05:53:07.185Z","msg":"Operation failed with internal error.","error":"GetTimerTasks operation failed. Select failed. Error: context deadline exceeded","operation":"GetTimerTasks","logging-call-at":"persistence_metric_clients.go:1281","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:156\ngo.temporal.io/server/common/persistence.updateErrorMetric\n\t/home/builder/temporal/common/persistence/persistence_metric_clients.go:1281\ngo.temporal.io/server/common/persistence.(*metricEmitter).recordRequestMetrics\n\t/home/builder/temporal/common/persistence/persistence_metric_clients.go:1258\ngo.temporal.io/server/common/persistence.(*executionPersistenceClient).GetHistoryTasks.func1\n\t/home/builder/temporal/common/persistence/persistence_metric_clients.go:400\ngo.temporal.io/server/common/persistence.(*executionPersistenceClient).GetHistoryTasks\n\t/home/builder/temporal/common/persistence/persistence_metric_clients.go:402\ngo.temporal.io/server/common/persistence.(*executionRetryablePersistenceClient).GetHistoryTasks.func1\n\t/home/builder/temporal/common/persistence/persistence_retryable_clients.go:380\ngo.temporal.io/server/common/backoff.ThrottleRetryContext\n\t/home/builder/temporal/common/backoff/retry.go:143\ngo.temporal.io/server/common/persistence.(*executionRetryablePersistenceClient).GetHistoryTasks\n\t/home/builder/temporal/common/persistence/persistence_retryable_clients.go:384\ngo.temporal.io/server/service/history/queues.NewScheduledQueue.func1.1\n\t/home/builder/temporal/service/history/queues/queue_scheduled.go:97\ngo.temporal.io/server/common/collection.(*PagingIteratorImpl[...]).getNextPage\n\t/home/builder/temporal/common/collection/paging_iterator.go:116\ngo.temporal.io/server/common/collection.NewPagingIterator[...]\n\t/home/builder/temporal/common/collection/paging_iterator.go:52\ngo.temporal.io/server/service/history/queues.(*IteratorImpl).HasNext\n\t/home/builder/temporal/service/history/queues/iterator.go:74\ngo.temporal.io/server/service/history/queues.(*SliceImpl).SelectTasks\n\t/home/builder/temporal/service/history/queues/slice.go:373\ngo.temporal.io/server/service/history/queues.(*ReaderImpl).loadAndSubmitTasks\n\t/home/builder/temporal/service/history/queues/reader.go:468\ngo.temporal.io/server/service/history/queues.(*ReaderImpl).eventLoop\n\t/home/builder/temporal/service/history/queues/reader.go:436"}
{"level":"error","ts":"2024-06-25T05:53:07.206Z","msg":"Queue reader unable to retrieve tasks","shard-id":1595,"address":"10.244.0.21:7234","component":"timer-queue-processor","queue-reader-id":0,"error":"GetTimerTasks operation failed. Select failed. Error: context deadline exceeded","logging-call-at":"reader.go:470","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:156\ngo.temporal.io/server/service/history/queues.(*ReaderImpl).loadAndSubmitTasks\n\t/home/builder/temporal/service/history/queues/reader.go:470\ngo.temporal.io/server/service/history/queues.(*ReaderImpl).eventLoop\n\t/home/builder/temporal/service/history/queues/reader.go:436"} 

Persistence DB is Postgres

One more thing to add:
When using tctl client you can see there is a workflow run open, but when you describe the very same job the status is “completed”

We’re seeing the same kind of symptoms.
We’re running Temporal server version 1.25.x

The observed symptoms are as follows:

  • db utilisation is low (< 10%)
  • query times in the db 99p is < 5ms
  • persistence availability in temporal is 4% !!
  • backlog of tasks in queue is > 300k
  • max-clients in temporal is 50 and max-clients in pgbouncer is 100
  • pgbouncer shows that only a handful of connections is used to interact with the db
  • Temporal isnt able to recover in meaningful time from this by scaling the server nodes and the worker nodes up until we stop the influx of requests completely and then it takes hours and hours (even days) to process 300k outstanding tasks.

The logs we observe, which we cannot explain from db side are as follows, which happen on all Temporal nodes (as they use the History node).

{“level”:“info”,“ts”:“2024-12-05T11:32:22.185Z”,“msg”:“history client encountered error”,“service”:“matching”,“error”:“GetWorkflowExecution: failed. Error: no usable database connection found”,“service-error-type”:“serviceerror.Unavailable”,“logging-call-at”:“/home/runner/work/docker-builds/docker-builds/temporal/client/history/metric_client.go:104”}

The following screenshot show pgbouncer having only 2 active connections of a much larger available pool from Temporal history while almost every persistence call fails with the above error at the Temporal side.

Error: no usable database connection found”

what server version do you use? there was very similar issue described here, also see issue here for reference.

this was fixed in patch release 1.25.1 - Release v1.25.1 · temporalio/temporal · GitHub

can you upgrade to it and see if you still run into same problem?

We are actually running 1.25.2, so unfortunately this was not solved by that release.