Errors when deleting executions visibility

We have Temporal deployed using the version 1.20.3 of the helm chart. We are using MySQL 8 as our DB, on AWS Aurora. We have the cluster set up to use 128 shards.

The Temporal cluster has been up and running for a few months and working mostly with no issues. A couple of times we’ve run into this issue where deleting execution visibility fails. When this happens we get flooded with error logs from the temporal-history service. Other than these errors and older workflow executions failing to be deleting, the cluster seems to be working fine and workflows continue to get executed as expected.

A few more details about these errors.

We get a “context deadline exceeded” error with the following stack trace:
go.temporal.io/server/common/log.(*zapLogger).Error
/home/builder/temporal/common/log/zap_logger.go:150
go.temporal.io/server/common/persistence/visibility.(*visibilityManagerMetrics).updateErrorMetric
/home/builder/temporal/common/persistence/visibility/visiblity_manager_metrics.go:258
go.temporal.io/server/common/persistence/visibility.(*visibilityManagerMetrics).DeleteWorkflowExecution
/home/builder/temporal/common/persistence/visibility/visiblity_manager_metrics.go:122
go.temporal.io/server/service/history.(*visibilityQueueTaskExecutor).processDeleteExecution
/home/builder/temporal/service/history/visibilityQueueTaskExecutor.go:494
go.temporal.io/server/service/history.(*visibilityQueueTaskExecutor).Execute
/home/builder/temporal/service/history/visibilityQueueTaskExecutor.go:122
go.temporal.io/server/service/history/queues.(*executableImpl).Execute
/home/builder/temporal/service/history/queues/executable.go:211
go.temporal.io/server/common/tasks.(*FIFOScheduler[...]).executeTask.func1
/home/builder/temporal/common/tasks/fifo_scheduler.go:231
go.temporal.io/server/common/backoff.ThrottleRetry.func1
/home/builder/temporal/common/backoff/retry.go:175
go.temporal.io/server/common/backoff.ThrottleRetryContext
/home/builder/temporal/common/backoff/retry.go:199
go.temporal.io/server/common/backoff.ThrottleRetry
/home/builder/temporal/common/backoff/retry.go:176
go.temporal.io/server/common/tasks.(*FIFOScheduler[...]).executeTask
/home/builder/temporal/common/tasks/fifo_scheduler.go:241
go.temporal.io/server/common/tasks.(*FIFOScheduler[...]).processTask
/home/builder/temporal/common/tasks/fifo_scheduler.go:217

From what I can gather we seem to be hitting some issue at the database level. We see a bunch of
START TRANSCATION followed by ROLLBACK statements.

Any insights into what might be causing this and how can possibly resolve? The only way we’ve been able to resolve is to completely reset the cluster. We’ve run into this 2-3 times and each time it has been unclear why this has showed up, after the cluster has been running with no issues for a while and with no change in load.