Some deployment questions for a redundant self-hosted deployment

horre · May 7, 2021, 10:17am

We are currently developing a service that runs a REST API and on certain request needs to trigger long running external services. So for example if a user registers, we need to initialise certain other services that might take a couple of minutes. So we are looking into Temporal to execute those as workflows.

For the deployment we currently have 3 large servers in our own datacenter (dom1, dom2, dom3) that run multiple services in linux containers. Each of the doms runs a MariaDB instance (mdb1, mdb2, mdb3) that are synchronised as a Galera cluster. And each of the doms also runs an api node (api1, api2, api3) to serve the requests. To make sure that we can survive both dom failure and single container failure, a load balancer distributes requests among the healthy api nodes and each api node is connected to all 3 database nodes.

We are not expecting high throughput on the workflows and executing those is not very time critical, but it’s important for us that workflows are started reliably. So what we would like is a cluster of Temporal servers that use our Galera cluster as backend. So on each dom, we would like to deploy one frontend, matching, history and worker service for starters (and eventually scale up services later if needed). Temporal services should be connected with Temporal servers on the other doms and each Temporal service should be connected to all 3 database nodes. Each api node should then utilise all 3 Temporal frontends to start the workflows.

Does this deployment make sense (or would it even work like this)?
Can Temporal utilize a MariaDB Galera cluster as datastore (can we just deploy the servers with multiple persistent sql connections)?
And is there a way to utilize the SDK client to connect to multiple Temporal servers for starting workflows? Or would we need to handle those cases ourselves to try a different Temporal server on failure?

maxim · May 7, 2021, 7:13pm

Temporal required strong consistency from a DB. Does the MariaDB Galera cluster provide such a guarantee? If it loses some events during failover it might lead to data corruption.

horre · May 10, 2021, 11:34am

Yes Galera clusters do provide strong consistency.

maxim · May 10, 2021, 3:59pm

Then your setup should work. I think if you care about reliability having just one instance of each role is not enough as anytime any of the roles goes down the service will be unavailable until it is restarted.

And is there a way to utilize the SDK client to connect to multiple Temporal servers for starting workflows?

SDK client connects to multiple instances of frontends but not multiple instances of the service. The single service instance can be pretty highly available if deployed correctly.

horre · May 10, 2021, 4:22pm

We will have all roles (I guess this means frontend, worker, history and matching service?) deployed on all dom servers, so for each role there will be 3 nodes running.

SDK client connects to multiple instances of frontends but not multiple instances of the service. The single service instance can be pretty highly available if deployed correctly.

Can I specify which service the client is connected to or is this handled by the frontend? E.g. can I have 3 clients initialised on my API node that are connected to different service nodes and if client1 fails, try client2?

maxim · May 10, 2021, 4:50pm

You mentioned that you already have a load balancer in front of the frontend nodes. So clients will reconnect to whatever frontend is available.

horre · May 10, 2021, 5:00pm

Ah okay now I understand. Yeah the load balancer was only for external requests to our API nodes. But we can setup an internal one to connect our API nodes to the Temporal services as well.

maxim · May 10, 2021, 5:15pm

Temporal gRPC client will load balance using round_robin policy by default so there is no hard requirement to have a load balancer in the middle.

horre · May 11, 2021, 12:37pm

We are using the Go client which seems to only have a single HostPort connection option (at least for the initial connection), see here. But as I understand the grpc client will hold a connection pool after the initial dial up? So we would set up a load balancer for the frontend service, use the HostPort of that load balancer for the initial client setup (so we are sure to initially connect) and afterwards the client will hold a connection pool and will handle load balancing by itself?

maxim · May 11, 2021, 3:10pm

I’m not a networking expert, but I think if a load balancer is used then all the connections go through it.

This can also help in some cases.

horre · May 11, 2021, 3:15pm

Ah no it’s just a dns based assignment. So it’s basically dns:///temporal-frontend-load-balancer as HostPort config, where the load balancer would then return an ip address from a healthy temporal frontend service. The client then connects to the actual temporal service directly.

maxim · May 11, 2021, 3:30pm

I see. The dns based assignment does make sense.

Topic		Replies	Views
How to easily deploy a temporal server cluster in production？ Community Support production	17	3940	February 18, 2025
Run temporal cluster with all services in a high available fashion on 2 different servers Community Support	0	352	August 18, 2023
How to scale temporal to run across multiple hosts for HA Server Deployment python-sdk , production , postgresql	5	953	February 24, 2025
Setting up a server cluster Community Support general-impl , server	3	1238	June 4, 2021
Running temporal across multiple Kubernetes clusters Community Support multicluster , kubernetes	6	1786	September 1, 2022

Some deployment questions for a redundant self-hosted deployment

Related topics