Does matching service hold latent connections?

Hi Folks

We’ve been successfully running Temporal in production for last one month. While most things are going well, there’s a performance behaviour we would like some help with.

We noticed that after sometime ( about a week ) Cassandra CPU slowly goes all the way up to 70%. But if we recycle matching-service pods, the CPU usage immediately drops to 20ish% and then it takes again a week to go back to 70%.
Further looking into Cassandra stats, we noticed that when we recycled matching-service pods, the CPU IOWaits dropped considerably. This made us think, is matching-service holding onto some latent / old connections? ( Please find attached screenshot of CPU IO Waits dropping immediately after matching pods restarts )

Would really appreciate some help on this.


@Chitresh_Deshpande, looks like we might be leaking TaskQueueManager for abandoned TaskQueues and not unloading them. I think we saw something similar ourself couple of weeks ago. Can you share persistence counters for UpdateTaskQueue calls by matching service? This would confirm this hypothesis.

If it is indeed the TaskQueueManager leak then this PR fixes the problem. It will be included in 1.11 release.

Thanks @samar . I will try to get you that metric. Unfortunately, the prometheus for Temporal was down during incident.

If you can share goroutine dump of matching service then we would be able to confirm too. So when it happens next time please gather goroutine dumps and cpu profile.

will do, thanks for your help Samar!