History service can't reconnect after RDS DB failover

Hello!

set up:

RDS Postgresql
Temporal Kubernetes in EKS with 1 replica for each service

I’ve tried to simulate failover scenario for RDS. I used this instruction. During failover UI was available but I noticed that even after DB came back to live (12:17:51 PM UTC - 12:18:22 PM UTC) state history wasn’t available in UI.

In the history logs:

{"level":"error","ts":"2022-09-19T12:33:20.374Z","msg":"Processor unable to retrieve tasks","shard-id":498,"address":"10.1.123.123:7234","component":"visibility-queue-processor","error":"GetVisibilityTasks operation failed. Select failed. Error: read tcp 10.1.123.123:57486->10.1.123.123:5432: read: connection timed out","logging-call-at":"queueProcessorBase.go:235","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:142\ngo.temporal.io/server/service/history.(*queueProcessorBase).processBatch\n\t/home/builder/temporal/service/history/queueProcessorBase.go:235\ngo.temporal.io/server/service/history.(*queueProcessorBase).processorPump\n\t/home/builder/temporal/service/history/queueProcessorBase.go:199"}

Once I delete history pod - everything works. Is it expected behavior?

Does RDS DB failover use async replication? Then it is not consistent and cannot be used with Temporal, which requires a fully consistent DB.

Hello Maxim. According to AWS docs it uses synchronous replication.

When you have Multi-AZ deployment, Amazon Relational Database Service (Amazon RDS) provisions one primary and one standby DB instance with synchronous physical replication of Amazon Elastic Block Store (Amazon EBS) storage for high availability and failover without data loss.

@maxim any thoughts here?

@maxim hey. I double checked the AWS RDS documentation and it describes failover process as following:

Failover is automatically handled by Amazon RDS so that you can resume database operations as quickly as possible without administrative intervention. When failing over, Amazon RDS simply flips the canonical name record (CNAME) for your DB instance to point at the standby, which is in turn promoted to become the new primary. We encourage you to follow best practices and implement database connection retry at the application layer.
Failovers, as defined by the interval between the detection of the failure on the primary and the resumption of transactions on the standby, typically complete within one to two minutes. Failover time can also be affected by whether large uncommitted transactions must be recovered; the use of adequately large instance types is recommended with Multi-AZ for best results. AWS also recommends the use of Provisioned IOPS with Multi-AZ instances, for fast, predictable, and consistent throughput performance.

Could you tell if we have some automatic retries in the history service?

What’s the Temporal server version you are using? Did you use helm charts to deploy the server onto EKS?

I tried two tests with 1.17.5 and 1.8.0 with postgres 13

  1. postgres running locally, stopped db and started it back up, temporal was able to recover
  2. postgres running on docker container, stopped container and started it back up and temporal was again able to recover and reconnect

Believe there is connection retries with the sql driver but i don’t know enough about RDS to know if there are any config options that might be preventing the connection to be re-established. Also would check possible issues between EKS and RDS

Could you share all history pod logs when you initiate the RDS failover?

Everything was updated to temporalio/server:1.18.0. During the failover operation AWS changes the DNS record that points to the master instance. I checked it several times - the change propagated within 30-60 seconds with no issues. I don’t think that during failover operation some network issues appeared between EKS and RDS.

Logs are attached in the thread in Slack.

Thanks for the info,
for the history logs that you shared for the two minutes between
2022-10-03T15:34:25.928Z
and
2022-10-03T15:36:51.189Z

it does look like on Temporal side connection failures are being retried (looks as almost every second) and keep failing as you mentioned with
Failed to commit transaction. Error: read tcp 10.1.124.51:52656->10.1.182.110:5432: read: connection

From frontend logs i see

Unable to call matching.PollWorkflowTaskQueue.","service":"frontend","wf-task-queue-name":"/_sys/default-worker-tq/2","timeout":"1m9.999771355s","error":"context deadline exceeded",

and

Operation failed with internal error.","error":"ListClusterMetadata operation failed. Failed to get cluster metadata rows. Error: dial tcp 10.1.182.110:5432: connect: connection timed out"

assume that your db connectAddr is 10.1.182.110:5432 right?

During the failover operation AWS changes the DNS record that points to the master instance

So I don’t know details about rds, but this seems could be the issue, is there some proxy you could configure the connection address to that could help with dns record updates?

DNS record was updated within 30seconds. Inside the temporal’s history pod I can see that DNS record changed to the new one. I’ve tried to do a failover again and it seems to be self-restored after 15mins.

Could you briefly describe what are the steps for this operation GetOrCreateShard? And why it can fail?