Some deployment questions for a redundant self-hosted deployment

We are currently developing a service that runs a REST API and on certain request needs to trigger long running external services. So for example if a user registers, we need to initialise certain other services that might take a couple of minutes. So we are looking into Temporal to execute those as workflows.

For the deployment we currently have 3 large servers in our own datacenter (dom1, dom2, dom3) that run multiple services in linux containers. Each of the doms runs a MariaDB instance (mdb1, mdb2, mdb3) that are synchronised as a Galera cluster. And each of the doms also runs an api node (api1, api2, api3) to serve the requests. To make sure that we can survive both dom failure and single container failure, a load balancer distributes requests among the healthy api nodes and each api node is connected to all 3 database nodes.

We are not expecting high throughput on the workflows and executing those is not very time critical, but it’s important for us that workflows are started reliably. So what we would like is a cluster of Temporal servers that use our Galera cluster as backend. So on each dom, we would like to deploy one frontend, matching, history and worker service for starters (and eventually scale up services later if needed). Temporal services should be connected with Temporal servers on the other doms and each Temporal service should be connected to all 3 database nodes. Each api node should then utilise all 3 Temporal frontends to start the workflows.

Does this deployment make sense (or would it even work like this)?
Can Temporal utilize a MariaDB Galera cluster as datastore (can we just deploy the servers with multiple persistent sql connections)?
And is there a way to utilize the SDK client to connect to multiple Temporal servers for starting workflows? Or would we need to handle those cases ourselves to try a different Temporal server on failure?

Temporal required strong consistency from a DB. Does the MariaDB Galera cluster provide such a guarantee? If it loses some events during failover it might lead to data corruption.

Yes Galera clusters do provide strong consistency.

Then your setup should work. I think if you care about reliability having just one instance of each role is not enough as anytime any of the roles goes down the service will be unavailable until it is restarted.

And is there a way to utilize the SDK client to connect to multiple Temporal servers for starting workflows?

SDK client connects to multiple instances of frontends but not multiple instances of the service. The single service instance can be pretty highly available if deployed correctly.

We will have all roles (I guess this means frontend, worker, history and matching service?) deployed on all dom servers, so for each role there will be 3 nodes running.

SDK client connects to multiple instances of frontends but not multiple instances of the service. The single service instance can be pretty highly available if deployed correctly.

Can I specify which service the client is connected to or is this handled by the frontend? E.g. can I have 3 clients initialised on my API node that are connected to different service nodes and if client1 fails, try client2?

You mentioned that you already have a load balancer in front of the frontend nodes. So clients will reconnect to whatever frontend is available.

Ah okay now I understand. Yeah the load balancer was only for external requests to our API nodes. But we can setup an internal one to connect our API nodes to the Temporal services as well.

Temporal gRPC client will load balance using round_robin policy by default so there is no hard requirement to have a load balancer in the middle.

We are using the Go client which seems to only have a single HostPort connection option (at least for the initial connection), see here. But as I understand the grpc client will hold a connection pool after the initial dial up? So we would set up a load balancer for the frontend service, use the HostPort of that load balancer for the initial client setup (so we are sure to initially connect) and afterwards the client will hold a connection pool and will handle load balancing by itself?

I’m not a networking expert, but I think if a load balancer is used then all the connections go through it.

This can also help in some cases.

Ah no it’s just a dns based assignment. So it’s basically dns:///temporal-frontend-load-balancer as HostPort config, where the load balancer would then return an ip address from a healthy temporal frontend service. The client then connects to the actual temporal service directly.

I see. The dns based assignment does make sense.