Temporal Node Resource usage is very low. But some pods of services keep restarting

Hi team,

We have the temporal server deployed on AKS node pool where there are over 12 nodes (each node has 8 vcpu, 32 GB memory).

Server settings
The server has 4 replicas of frontend, matching, history and worker services, 1 web UI, and 1 admin tool service. These services are hosted in cluster A. The temporal server version is 1.24.2.0.

Database
We use a standalone PostgreSQL 12 database hosted in cluster B as the persistence and visibility stores. The database schema version is 1.24.2.

The maxConns is set to 20 and maxConnLifetime is set to 1h. Max Client Connection in PG bouncer is set to 1000.

The database instance has 8 CPU, 32 Gi memory, 512 Gi Disk.

Deployment
We use helm chart to deploy the server. When the replica of the temporal server services is 4, I observed 1 or 2 pods of random services kept restarting.

The error in the pod says “SQL schema version compatibility check failed: driver: bad connection”. The detail is attached in the screenshot below.

Issue
With this issue, the temporal server is able to run, but the persistence latency is way higher than normal (over 0.15s per impacted service).

When the replica count is reduced from 4 to 3, the issue is gone.

I checked the node resource usage (CPU, memory, disk). CPU usage is about 5 - 7%, memory is around 10%, and disk is 20%.

It is much appreciated for any help! Thank you!

@maxim @tihomir

“SQL schema version compatibility check failed: driver: bad connection”.

when you started seeing this was it associated with any changes to Temporal server (static config) or any changes / updates to your db (primary / visibility persistence)?

When the replica count is reduced from 4 to 3, the issue is gone.

this is interesting, does after this increasing it back to 4 cause issue again?
my guess is that you shut down older pod which was connecting to either misconfigured or changed db connection. last time i saw similar issue it was due to visibility db that was shut down after some pods started up.

Thank you, Tihomir!

yes, I tried to increase replica from 3 to 4, the issue didn’t happen. But if I keep increasing the replica from 4 to 6, it happened again.

The replica count is the config being updated.

1 matching service pod and 1 worker service pod crashed with error: “msg”:“start failed, rolling back”,“component”:“fx”,“error”:“unable to initialize system namespace: unable to initialize metadata manager: driver: bad connection”.

(post deleted by author)

I found the numofshards misconfig issue in the code and fixed it. The issue described above is gone when replica is 6.

And then I increased the replica from 6 to 10 (10 matching, 10 history, 10 frontend, 10 worker), the error “sql schema version compatibility check failed: driver: bad connection” came back again. It occurred in crashed services.

I also checked other worker pod which didn’t crash. I saw several errors and warnings too.

Error 1: “WorkerType”:“ActivityWorker”,“Error”:“Not enough hosts to serve the request”,

Error 2: “error starting temporal-sys-history-scanner-workflow workflow”,“service”:“worker”,“error”:“context deadline exceeded”,

The errors indicated some of your pods are restarting / failing due to low resources. This aligns with the fact that other pods are crashing.

@tihomir