I actually fixed it. There was probably a network connectivity problem with one of my nodes. I deleted all Cadence replicasets, and then couldn’t replicate this problem anymore.
I’ve added short retries in the cadence clientside to ensure that if one of the services has a communication problem, load balancing can ensure that a healthy pod is able to fulfill the request.
I don’t know if something like LinkerD would help in these situations in production. Don’t currently use it.