We’ve been successfully running Temporal in production for last one month. While most things are going well, there’s a performance behaviour we would like some help with.
We noticed that after sometime ( about a week ) Cassandra CPU slowly goes all the way up to 70%. But if we recycle matching-service pods, the CPU usage immediately drops to 20ish% and then it takes again a week to go back to 70%.
Further looking into Cassandra stats, we noticed that when we recycled matching-service pods, the CPU IOWaits dropped considerably. This made us think, is matching-service holding onto some latent / old connections? ( Please find attached screenshot of CPU IO Waits dropping immediately after matching pods restarts )
Would really appreciate some help on this.