Growing open persistence connections

Hi all, we operate on a active-passive cluster and for some reason one passive cluster started opening thousands(10k at peak) of connections one evening. We have connection limits for sql for both regular and vis dbs in passive and active, but it doesn’t appear these were respected in the passive cluster

maxCons: 128
maxIdleConns: 128
maxConnLifetime: 1h

Given the above configs I believe the limit per host in the cluster be 256 but it doesn’t appear respected. During the time the connections started opening we see a few of these errors

{"level":"warn","ts":"2024-01-12T00:01:00.401Z","msg":"RecordActivityHeartbeat with error","service":"worker","Namespace":"temporal-system","TaskQueue":"temporal-sys-tq-scanner-taskqueue-0","WorkerID":"25@ltx1-app24643.prod.linkedin.com@","ActivityID":"5","ActivityType":"temporal-sys-tq-scanner-scvg-activity","Attempt":1,"WorkflowType":"temporal-sys-tq-scanner-workflow","WorkflowID":"temporal-sys-tq-scanner","RunID":"87557f0b-63dd-4065-82b2-7ae4a147c958","Error":"context canceled","logging-call-at":"internal_task_handlers.go:1750"}
{"level":"error","ts":"2024-01-12T03:54:02.085Z","msg":"Error updating queue state","shard-id":256,"address":"00.000.00.000:7234","component":"transfer-queue-processor","error":"shard status unknown","operation-result":"OperationFailed","logging-call-at":"queue_base.go:433","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/runner/_work/foobar/foobar/build/foobar/target/src/go.temporal.io/server/common/log/zap_logger.go:144\ngo.temporal.io/server/service/history/queues.(*queueBase).updateQueueState\n\t/runner/_work/foobar/foobar/build/foobar/target/src/go.temporal.io/server/service/history/queues/queue_base.go:433\ngo.temporal.io/server/service/history/queues.(*queueBase).checkpoint\n\t/runner/_work/foobar/foobar/build/foobar/target/src/go.temporal.io/server/service/history/queues/queue_base.go:395\ngo.temporal.io/server/service/history/queues.(*immediateQueue).processEventLoop\n\t/runner/_work/foobar/foobar/build/foobar/target/src/go.temporal.io/server/service/history/queues/queue_immediate.go:164"}
{"level":"error","ts":"2024-01-12T03:54:02.086Z","msg":"error updating replication level for shard","shard-id":324,"address":"00.000.00.000:7234","error":"shard status unknown","operation-result":"OperationFailed","logging-call-at":"get_tasks.go:55","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/runner/_work/foobar/foobar/build/foobar/target/src/go.temporal.io/server/common/log/zap_logger.go:144\ngo.temporal.io/server/service/history/api/replication.GetTasks\n\t/runner/_work/foobar/foobar/build/foobar/target/src/go.temporal.io/server/service/history/api/replication/get_tasks.go:55\ngo.temporal.io/server/service/history.(*historyEngineImpl).GetReplicationMessages\n\t/runner/_work/foobar/foobar/build/foobar/target/src/go.temporal.io/server/service/history/historyEngine.go:750\ngo.temporal.io/server/service/history.(*Handler).GetReplicationMessages.func1\n\t/runner/_work/foobar/foobar/build/foobar/target/src/go.temporal.io/server/service/history/handler.go:1417"}
{"level":"info","ts":"2024-01-12T03:54:04.797Z","msg":"Range updated for shardID","shard-id":324,"address":"00.000.00.000:7234","shard-range-id":6,"previous-shard-range-id":5,"number":5242880,"next-number":6291456,"logging-call-at":"context_impl.go:1217"}
{"level":"info","ts":"2024-01-12T03:54:04.797Z","msg":"Acquired shard","shard-id":324,"address":"00.000.00.000:7234","logging-call-at":"context_impl.go:1922"}
{"level":"error","ts":"2024-01-12T03:54:04.797Z","msg":"Error updating queue state","shard-id":324,"address":"00.000.00.000:7234","component":"visibility-queue-processor","error":"shard status unknown","operation-result":"OperationFailed","logging-call-at":"queue_base.go:433","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/runner/_work/foobar/foobar/build/foobar/target/src/go.temporal.io/server/common/log/zap_logger.go:144\ngo.temporal.io/server/service/history/queues.(*queueBase).updateQueueState\n\t/runner/_work/foobar/foobar/build/foobar/target/src/go.temporal.io/server/service/history/queues/queue_base.go:433\ngo.temporal.io/server/service/history/queues.(*queueBase).checkpoint\n\t/runner/_work/foobar/foobar/build/foobar/target/src/go.temporal.io/server/service/history/queues/queue_base.go:395\ngo.temporal.io/server/service/history/queues.(*immediateQueue).processEventLoop\n\t/runner/_work/foobar/foobar/build/foobar/target/src/go.temporal.io/server/service/history/queues/queue_immediate.go:164"}
{"level":"info","ts":"2024-01-12T03:54:04.797Z","msg":"Range updated for shardID","shard-id":256,"address":"00.000.00.000:7234","shard-range-id":6,"previous-shard-range-id":5,"number":5242880,"next-number":6291456,"logging-call-at":"context_impl.go:1217"}
{"level":"info","ts":"2024-01-12T03:54:04.797Z","msg":"Acquired shard","shard-id":256,"address":"00.000.00.000:7234","logging-call-at":"context_impl.go:1922"}
{"level":"error","ts":"2024-01-12T04:14:00.745Z","msg":"Error updating queue state","shard-id":351,"address":"00.000.00.000:7234","component":"visibility-queue-processor","error":"shard status unknown","operation-result":"OperationFailed","logging-call-at":"queue_base.go:433","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/runner/_work/foobar/foobar/build/foobar/target/src/go.temporal.io/server/common/log/zap_logger.go:144\ngo.temporal.io/server/service/history/queues.(*queueBase).updateQueueState\n\t/runner/_work/foobar/foobar/build/foobar/target/src/go.temporal.io/server/service/history/queues/queue_base.go:433\ngo.temporal.io/server/service/history/queues.(*queueBase).checkpoint\n\t/runner/_work/foobar/foobar/build/foobar/target/src/go.temporal.io/server/service/history/queues/queue_base.go:395\ngo.temporal.io/server/service/history/queues.(*immediateQueue).processEventLoop\n\t/runner/_work/foobar/foobar/build/foobar/target/src/go.temporal.io/server/service/history/queues/queue_immediate.go:164"}
{"level":"error","ts":"2024-01-12T04:14:00.745Z","msg":"Error updating queue state","shard-id":351,"address":"00.000.00.000:7234","component":"transfer-queue-processor","error":"shard status unknown","operation-result":"OperationFailed","logging-call-at":"queue_base.go:433","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/runner/_work/foobar/foobar/build/foobar/target/src/go.temporal.io/server/common/log/zap_logger.go:144\ngo.temporal.io/server/service/history/queues.(*queueBase).updateQueueState\n\t/runner/_work/foobar/foobar/build/foobar/target/src/go.temporal.io/server/service/history/queues/queue_base.go:433\ngo.temporal.io/server/service/history/queues.(*queueBase).checkpoint\n\t/runner/_work/foobar/foobar/build/foobar/target/src/go.temporal.io/server/service/history/queues/queue_base.go:395\ngo.temporal.io/server/service/history/queues.(*immediateQueue).processEventLoop\n\t/runner/_work/foobar/foobar/build/foobar/target/src/go.temporal.io/server/service/history/queues/queue_immediate.go:164"}
{"level":"error","ts":"2024-01-12T04:14:00.745Z","msg":"Error updating queue state","shard-id":351,"address":"00.000.00.000:7234","component":"timer-queue-processor","error":"shard status unknown","operation-result":"OperationFailed","logging-call-at":"queue_base.go:433","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/runner/_work/foobar/foobar/build/foobar/target/src/go.temporal.io/server/common/log/zap_logger.go:144\ngo.temporal.io/server/service/history/queues.(*queueBase).updateQueueState\n\t/runner/_work/foobar/foobar/build/foobar/target/src/go.temporal.io/server/service/history/queues/queue_base.go:433\ngo.temporal.io/server/service/history/queues.(*queueBase).checkpoint\n\t/runner/_work/foobar/foobar/build/foobar/target/src/go.temporal.io/server/service/history/queues/queue_base.go:395\ngo.temporal.io/server/service/history/queues.(*scheduledQueue).processEventLoop\n\t/runner/_work/foobar/foobar/build/foobar/target/src/go.temporal.io/server/service/history/queues/queue_scheduled.go:192"}
{"level":"info","ts":"2024-01-12T04:14:01.143Z","msg":"Range updated for shardID","shard-id":415,"address":"00.000.00.000:7234","shard-range-id":7,"previous-shard-range-id":6,"number":6291456,"next-number":7340032,"logging-call-at":"context_impl.go:1217"}

I also see a few deadlock errors during the connection spike on other hosts

{"level":"info","ts":"2024-01-12T04:02:55.392Z","msg":"Range updated for shardID","shard-id":293,"address":"10.154.93.89:7234","shard-range-id":8,"previous-shard-range-id":7,"number":7340032,"next-number":8388608,"logging-call-at":"context_impl.go:1217"}
{"level":"info","ts":"2024-01-12T04:02:55.392Z","msg":"Acquired shard","shard-id":293,"address":"10.154.93.89:7234","logging-call-at":"context_impl.go:1922"}
{"level":"error","ts":"2024-01-12T04:16:19.231Z","msg":"potential deadlock detected","service":"history","name":"Shard(320)","logging-call-at":"deadlock.go:117","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/runner/_work/foobar/foobar/build/foobar/target/src/go.temporal.io/server/common/log/zap_logger.go:144\ngo.temporal.io/server/common/deadlock.(*deadlockDetector).detected\n\t/runner/_work/foobar/foobar/build/foobar/target/src/go.temporal.io/server/common/deadlock/deadlock.go:117\ngo.temporal.io/server/common/deadlock.(*loopContext).worker.func1\n\t/runner/_work/foobar/foobar/build/foobar/target/src/go.temporal.io/server/common/deadlock/deadlock.go:209"}
{"level":"info","ts":"2024-01-12T04:16:19.244Z","msg":"dumping goroutine profile for suspected deadlock","service":"history","logging-call-at":"deadlock.go:147"}
{"level":"info","ts":"2024-01-12T04:16:19.244Z","msg":"goroutine profile: total 3484\n1536 @ 0x43e7b6 0x44e63c 0x1bb5bf5 0x46fdc1\n#\t0x1bb5bf4\tgo.temporal.io/server/common/tasks.(*FIFOScheduler[...]).processTask+0x134\t/runner/_work/foobar/foobar/build/foobar/target/src/go.temporal.io/server/common/tasks/fifo_scheduler.go:215\n\n304 @ 0x43e7b6 0x44e63c 0x1b8ac9c 0x46fdc1\n#\t0x1b8ac9b\tgo.temporal.io/server/common/timer.NewLocalGate.func1+0xdb\t/runner/_work/foobar/foobar/build/foobar/target/src/go.temporal.io/server/common/timer/local_gate.go:72\n\n228 @ 0x43e7b6 0x44e63c 0x1ba5650 0x46fdc1\n#\t0x1ba564f\tgo.temporal.io/server/service/history/queues.(*ReaderImpl).eventLoop+0xaf\t/runner/_work/foobar/foobar/build/foobar/target/src/go.temporal.io/server/service/history/queues/reader.go:399\n\n228 @ 0x43e7b6 0x44e63c 0x1ba8a4f 0x46fdc1\n#\t0x1ba8a4e\tgo.temporal.io/server/service/history/queues.(*reschedulerImpl).rescheduleLoop+0x12e\t/runner/_work/foobar/foobar/build/foobar/target/src/go.temporal.io/server/service/history/queues/rescheduler.go:208\n\n149 @ 0x43e7b6 0x44e63c 0x1b9ebaf 0x46fdc1\n#\t0x1b9ebae\tgo.temporal.io/server/service/history/queues.(*immediateQueue).processEventLoop+0x1ae\t/runner/_work/foobar/foobar/build/foobar/target/src/go.temporal.io/server/service/history/queues/queue_immediate.go:156\n\n76 @ 0x43e7b6 0x436fb7 0x46a189 0x4e0352 0x4e16ba 0x4e16a8 0x64f989 0x6626e5 0x47bcba 0x18cb192 0x18cb155 0x46fdc1\n#\t0x46a188\tinternal/poll.runtime_pollWait+0x88\t\t\t\t/home/runner/.gradle/language/golang/1.19.2/go/src/runtime/netpoll.go:305\n#\t0x4e0351\tinternal/poll.(*pollDesc).wait+0x31\t\t\t\t/home/runner/.gradle/language/golang/1.19.2/go/src/internal/poll/fd_poll_runtime.go:84\n#\t0x4e16b9\tinternal/poll.(*pollDesc).waitRead+0x259\t\t\t/home/runner/.gradle/language/golang/1.19.2/go/src/internal/poll/fd_poll_runtime.go:89\n#\t0x4e16a7\tinternal/poll.(*FD).Read+0x247\t\t\t\t\t/home/runner/.gradle/language/golang/1.19.2/go/src/internal/poll/fd_unix.go:167\n#\t0x64f988\tnet.(*netFD).Read+0x28\t\t\t\t\t\t/home/runner/.gradle/language/golang/1.19.2/go/src/net/fd_posix.go:55\n#\t0x6626e4\tnet.(*conn).Read+0x44\t\t\t\t\t\t/home/runner/.gradle/language/golang/1.19.2/go/src/net/net.go:183\n#\t0x47bcb9\tio.ReadAtLeast+0x99\t\t\t\t\t\t/home/runner/.gradle/language/golang/1.19.2/go/src/io/io.go:332\n#\t0x18cb191\tio.ReadFull+0x91\t\t\t\t\t\t/home/runner/.gradle/language/golang/1.19.2/go/src/io/io.go:351\n#\t0x18cb154\tgithub.com/temporalio/tchannel-go.(*Connection).readFrames+0x54\t/runner/_work/foobar/foobar/build/foobar/target/src/github.com/temporalio/tchannel-go/connection.go:660\n\n76 @ 0x43e7b6 0x44e63c 0x18cb866 0x46fdc1\n#\t0x18cb865\tgithub.com/temporalio/tchannel-go.(*Connection).writeFrames+0x85\t/runner/_work/foobar/foobar/build/foobar/target/src/github.com/temporalio/tchannel-go/connection.go:737\n\n76 @ 0x43e7b6 0x44e63c 0x1b5a94e 0x46fdc1\n#\t0x1b5a94d\tgo.temporal.io/server/service/history/replication.(*taskProcessorManagerImpl)...

The passive cluster was working fine for prior and all other hosts in clusters are running normally. Any idea what would cause this rapid growth in open connections?

Temporal version: 1.19.1