Help with scaling our infrastructure

lunne · December 30, 2021, 8:44am

Hi!
We are moving into production with temporal and everything that comes with it. We are using running a hybrid cloud solution with Nomad and Consul. The whole temporal stack will be running on AWS.

Initially we are moving the current solution over to Temporal which means we need to spawn ~2.6 million workflows to set everything up. We can do this over the period of 1-2 weeks.

Expected Load

We have a plan to scale up the usage of Workflows and most new development will be spawning a lot of new workflows. But for the coming year the we have the current expected load

The CPU load is not defined in the data because Nomad only sets a “minimum” CPU needed to run the allocation. Memory looks to be the bottle neck and not the CPU.
The AWS servers we are using has the following specs:
Memory: 62,851 MiB,
CPU: 49,600 MHz

The current load in the AWS Cluster:
102.41 GiB of 185.45 GiB,
23.25 GHz of 148.8 GHz

Totals

Long Workflows per month	15 000
Short Activation Workflows per month	76 800
Short other workflows per month	15 750
Total New workflows per month	107 550
New workflows per minute	~ 2.49

Secondary actions, deactivations, shelves, etc

Action	Workflows / action	Actions / month	Workflows / month
Suspend	5	230	1 150
Resume	5	160	800
Terminate	5	2 760	13 800
Total			15 750

The current infrastructure

Cassandra

Node	Memory	CPU	Storage	Server
node1	8G, 6gig heap, 2gig newsize	-	AWS Elastic	AWS XL
node2	8G, 6gig heap, 2gig newsize	-	AWS Elastic	AWS XL
node3	8G, 6gig heap, 2gig newsize	-	AWS Elastic	AWS XL

Elastic

Node	Memory	CPU	Storage	Server
node1	8G, “-Xms6144m -Xmx6144m”	-	AWS Elastic	AWS XL
node2	8G, “-Xms6144m -Xmx6144m”	-	AWS Elastic	AWS XL
node3	8G, “-Xms6144m -Xmx6144m”	-	AWS Elastic	AWS XL

Temporal

| Service | Replicas | Memory | CPU | Server |
|–|–|–|–| --| --|
| frontend | 2 | 512M | - | any |
| history | 2 | 1536M | - | any |
| matching | 2 | 512M | - | any |
| worker | 2 | 256M | - | any |
| web | 1 | 256M | - | any |

Do you have any specific recommendations for a setup like this?

maxim · January 3, 2022, 10:38pm

To me, it looks like overkill for such a low load.

lunne · January 4, 2022, 7:16pm

How would you scale this if it was up to you?
From Elastic → Cassandra and Temporal ?

We want to keep two of each temporal services (except web) running to enable rolling updates etc.

br,
Mathias

maxim · January 4, 2022, 8:32pm

Are you using Cassandra for replication support? A single MySQL or PostgreSQL would support such load without problem.

lunne · January 5, 2022, 2:23pm

Well.
We wanted to build something that could withstand upgrades/updates and failures. So we started out with the basic thought that lets run 3 of things. 5 would be better but… 3 is already tpo much looking at the expected load.

So we made 3 of most things,

a sharded replicaset of 3 mongos
kafka on 3 nodes
cassandra on 3 hosts
elastic on 3 nodes
nomad, consul and Vault on 3 nodes (separate VPS)

We then stretch our nomad “cloud” over 3-5 client/nodes and allocate our services on any of the these nodes. We also have on premise servers that are connected to this and can be used for allocations.

We could move to MySQL instead and we have talked about it a lot but then we would be in a master/slave situation or using a cloud mysql service. ( We also try our best not to lock ourself into on specific cloud. )

But we will not in a near future need to scale anything up? We could instead scaleing down a bit on memory for cassandra, elastic etc?

maxim · January 5, 2022, 5:53pm

Your original question was about scale. But from the availability point of view using Cassandra and at more than one instance of each service sounds reasonable. I would still perf test your setup to be sure.

lunne · January 5, 2022, 8:38pm

Ok!
We have been testing a bit with GitHub - temporalio/maru: Benchmarks for Temporal workflows
and the throughput has been more than good enough. But as we are deploying so much new things at once I wanted to run it by you to sleep a bit better a night.

Topic		Replies	Views
What is the recommended setup for running Cadence/Temporal with Cassandra on production? Community Support cassandra , production	10	16325	November 21, 2023
How can i scale my temporal on EC2? Community Support general-impl	9	1194	March 14, 2023
Temporal instead of step funtions and Lambda for Infra provisioning Community Support	2	830	November 14, 2022
How many parallel workflows can temporal support? Community Support	6	2685	July 7, 2022
Temporal studying - various questions Community Support	5	1492	February 9, 2021