Hi all! We’re currently evaluating Temporal for our workflow needs and so far it really checks all the boxes we want - but now that we try to sketch up how it would fit our current infrastructure architecture it get’s complicated:
Thing is: We run multiple Kubernetes clusters in parallel. For simplicity, let’s just call them A, B and C. All of our apps are deployed to A, B and C and backed by a Cassandra cluster (outside of K8s). All clusters receive production traffic at all times. So at first glance Temporal looked like the perfect fit for us, since we would only run the Temporal “cluster” inside of our Kubernetes cluster - like recommended.
As a first prototype we tried to connect two Temporal “Clusters” running in two individual Kubernetes clusters to the same database and quickly ran into the slowness that Maxim already described in this issue:
Temporal will make sure that data is not corrupted, but such setup will not be functional as these services will be stealing shards from each other all the time. So it will be 100 times slower than a single cluster setup.
Which in our concrete case results in workflows being blocked for up to 5 minutes.
But - we don’t want to run multiple Temporal clusters to achieve HA. We just do not want to have a “Temporal is running only in Kubernetes Cluster A” situation. Neither do we want or need something like a failover setup where we would need to deal with a active / passive Temporal cluster. All apps should just be able to talk to their “cluster local” Temporal frontend.
The architecture guide mentions horizontal scaling for all the components inside of a Temporal cluster, so our next idea would be to basically “stretch” a Temporal cluster across multiple Kubernetes clusters:
So, instead of running multiple replicas of the worker service in Kubernetes cluster A, we would run 1 replica of the worker in A, B and C each. Same goes for all the other components.
My question is: Are we far off the tracks or is this a reasonable deployment option?