History service can't reconnect after RDS DB failover


set up:

RDS Postgresql
Temporal Kubernetes in EKS with 1 replica for each service

I’ve tried to simulate failover scenario for RDS. I used this instruction. During failover UI was available but I noticed that even after DB came back to live (12:17:51 PM UTC - 12:18:22 PM UTC) state history wasn’t available in UI.

In the history logs:

{"level":"error","ts":"2022-09-19T12:33:20.374Z","msg":"Processor unable to retrieve tasks","shard-id":498,"address":"","component":"visibility-queue-processor","error":"GetVisibilityTasks operation failed. Select failed. Error: read tcp> read: connection timed out","logging-call-at":"queueProcessorBase.go:235","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:142\ngo.temporal.io/server/service/history.(*queueProcessorBase).processBatch\n\t/home/builder/temporal/service/history/queueProcessorBase.go:235\ngo.temporal.io/server/service/history.(*queueProcessorBase).processorPump\n\t/home/builder/temporal/service/history/queueProcessorBase.go:199"}

Once I delete history pod - everything works. Is it expected behavior?

Does RDS DB failover use async replication? Then it is not consistent and cannot be used with Temporal, which requires a fully consistent DB.

Hello Maxim. According to AWS docs it uses synchronous replication.

When you have Multi-AZ deployment, Amazon Relational Database Service (Amazon RDS) provisions one primary and one standby DB instance with synchronous physical replication of Amazon Elastic Block Store (Amazon EBS) storage for high availability and failover without data loss.

@maxim any thoughts here?

@maxim hey. I double checked the AWS RDS documentation and it describes failover process as following:

Failover is automatically handled by Amazon RDS so that you can resume database operations as quickly as possible without administrative intervention. When failing over, Amazon RDS simply flips the canonical name record (CNAME) for your DB instance to point at the standby, which is in turn promoted to become the new primary. We encourage you to follow best practices and implement database connection retry at the application layer.
Failovers, as defined by the interval between the detection of the failure on the primary and the resumption of transactions on the standby, typically complete within one to two minutes. Failover time can also be affected by whether large uncommitted transactions must be recovered; the use of adequately large instance types is recommended with Multi-AZ for best results. AWS also recommends the use of Provisioned IOPS with Multi-AZ instances, for fast, predictable, and consistent throughput performance.

Could you tell if we have some automatic retries in the history service?