We’ve been successfully running Temporal in production for last one month. While most things are going well, there’s a performance behaviour we would like some help with.
We noticed that after sometime ( about a week ) Cassandra CPU slowly goes all the way up to 70%. But if we recycle matching-service pods, the CPU usage immediately drops to 20ish% and then it takes again a week to go back to 70%.
Further looking into Cassandra stats, we noticed that when we recycled matching-service pods, the CPU IOWaits dropped considerably. This made us think, is matching-service holding onto some latent / old connections? ( Please find attached screenshot of CPU IO Waits dropping immediately after matching pods restarts )
@Chitresh_Deshpande, looks like we might be leaking TaskQueueManager for abandoned TaskQueues and not unloading them. I think we saw something similar ourself couple of weeks ago. Can you share persistence counters for UpdateTaskQueue calls by matching service? This would confirm this hypothesis.
If you can share goroutine dump of matching service then we would be able to confirm too. So when it happens next time please gather goroutine dumps and cpu profile.