Does matching service hold latent connections?

Chitresh_Deshpande · July 15, 2021, 12:48am

Hi Folks

We’ve been successfully running Temporal in production for last one month. While most things are going well, there’s a performance behaviour we would like some help with.

We noticed that after sometime ( about a week ) Cassandra CPU slowly goes all the way up to 70%. But if we recycle matching-service pods, the CPU usage immediately drops to 20ish% and then it takes again a week to go back to 70%.
Further looking into Cassandra stats, we noticed that when we recycled matching-service pods, the CPU IOWaits dropped considerably. This made us think, is matching-service holding onto some latent / old connections? ( Please find attached screenshot of CPU IO Waits dropping immediately after matching pods restarts )

Would really appreciate some help on this.

Thanks
Chitresh

samar · July 15, 2021, 4:14pm

@Chitresh_Deshpande, looks like we might be leaking TaskQueueManager for abandoned TaskQueues and not unloading them. I think we saw something similar ourself couple of weeks ago. Can you share persistence counters for UpdateTaskQueue calls by matching service? This would confirm this hypothesis.

samar · July 15, 2021, 4:39pm

If it is indeed the TaskQueueManager leak then this PR fixes the problem. It will be included in 1.11 release.

Chitresh_Deshpande · July 15, 2021, 4:43pm

Thanks @samar . I will try to get you that metric. Unfortunately, the prometheus for Temporal was down during incident.

samar · July 15, 2021, 4:49pm

If you can share goroutine dump of matching service then we would be able to confirm too. So when it happens next time please gather goroutine dumps and cpu profile.

Chitresh_Deshpande · July 15, 2021, 4:50pm

will do, thanks for your help Samar!

Topic		Replies	Views
Matching service high QPS on persistence Community Support cadence	15	2036	August 6, 2021
History Service CPU usage Community Support	5	937	March 26, 2021
Matching service GetTaskQueue latency metrics is very large Community Support metrics	1	686	December 11, 2023
Temporal not writing to cassandra? Community Support cassandra	2	531	March 2, 2022
Temporal Node Resource usage is very low. But some pods of services keep restarting Community Support general-impl	3	53	January 14, 2025

Does matching service hold latent connections?

Related topics