Hi team,
We have the temporal server deployed on AKS node pool where there are over 12 nodes (each node has 8 vcpu, 32 GB memory).
Server settings
The server has 4 replicas of frontend, matching, history and worker services, 1 web UI, and 1 admin tool service. These services are hosted in cluster A. The temporal server version is 1.24.2.0.
Database
We use a standalone PostgreSQL 12 database hosted in cluster B as the persistence and visibility stores. The database schema version is 1.24.2.
The maxConns is set to 20 and maxConnLifetime is set to 1h. Max Client Connection in PG bouncer is set to 1000.
The database instance has 8 CPU, 32 Gi memory, 512 Gi Disk.
Deployment
We use helm chart to deploy the server. When the replica of the temporal server services is 4, I observed 1 or 2 pods of random services kept restarting.
The error in the pod says “SQL schema version compatibility check failed: driver: bad connection”. The detail is attached in the screenshot below.
Issue
With this issue, the temporal server is able to run, but the persistence latency is way higher than normal (over 0.15s per impacted service).
When the replica count is reduced from 4 to 3, the issue is gone.
I checked the node resource usage (CPU, memory, disk). CPU usage is about 5 - 7%, memory is around 10%, and disk is 20%.
It is much appreciated for any help! Thank you!