Help with scaling our infrastructure

Hi!
We are moving into production with temporal and everything that comes with it. We are using running a hybrid cloud solution with Nomad and Consul. The whole temporal stack will be running on AWS.

Initially we are moving the current solution over to Temporal which means we need to spawn ~2.6 million workflows to set everything up. We can do this over the period of 1-2 weeks.

Expected Load

We have a plan to scale up the usage of Workflows and most new development will be spawning a lot of new workflows. But for the coming year the we have the current expected load

The CPU load is not defined in the data because Nomad only sets a “minimum” CPU needed to run the allocation. Memory looks to be the bottle neck and not the CPU.
The AWS servers we are using has the following specs:
Memory: 62,851 MiB,
CPU: 49,600 MHz

The current load in the AWS Cluster:
102.41 GiB of 185.45 GiB,
23.25 GHz of 148.8 GHz

Totals

Long Workflows per month 15 000
Short Activation Workflows per month 76 800
Short other workflows per month 15 750
Total New workflows per month 107 550
New workflows per minute ~ 2.49

Secondary actions, deactivations, shelves, etc

Action Workflows / action Actions / month Workflows / month
Suspend 5 230 1 150
Resume 5 160 800
Terminate 5 2 760 13 800
Total 15 750

The current infrastructure

Cassandra

Node Memory CPU Storage Server
node1 8G, 6gig heap, 2gig newsize - AWS Elastic AWS XL
node2 8G, 6gig heap, 2gig newsize - AWS Elastic AWS XL
node3 8G, 6gig heap, 2gig newsize - AWS Elastic AWS XL

Elastic

Node Memory CPU Storage Server
node1 8G, “-Xms6144m -Xmx6144m” - AWS Elastic AWS XL
node2 8G, “-Xms6144m -Xmx6144m” - AWS Elastic AWS XL
node3 8G, “-Xms6144m -Xmx6144m” - AWS Elastic AWS XL

Temporal

| Service | Replicas | Memory | CPU | Server |
|–|–|–|–| --| --|
| frontend | 2 | 512M | - | any |
| history | 2 | 1536M | - | any |
| matching | 2 | 512M | - | any |
| worker | 2 | 256M | - | any |
| web | 1 | 256M | - | any |

Do you have any specific recommendations for a setup like this?

2 Likes

To me, it looks like overkill for such a low load.

How would you scale this if it was up to you?
From Elastic → Cassandra and Temporal ?

We want to keep two of each temporal services (except web) running to enable rolling updates etc.

br,
Mathias

1 Like

Are you using Cassandra for replication support? A single MySQL or PostgreSQL would support such load without problem.

Well.
We wanted to build something that could withstand upgrades/updates and failures. So we started out with the basic thought that lets run 3 of things. 5 would be better but… 3 is already tpo much looking at the expected load.

So we made 3 of most things,

  • a sharded replicaset of 3 mongos
  • kafka on 3 nodes
  • cassandra on 3 hosts
  • elastic on 3 nodes
  • nomad, consul and Vault on 3 nodes (separate VPS)

We then stretch our nomad “cloud” over 3-5 client/nodes and allocate our services on any of the these nodes. We also have on premise servers that are connected to this and can be used for allocations.

We could move to MySQL instead and we have talked about it a lot but then we would be in a master/slave situation or using a cloud mysql service. ( We also try our best not to lock ourself into on specific cloud. )

But we will not in a near future need to scale anything up? We could instead scaleing down a bit on memory for cassandra, elastic etc?

Your original question was about scale. But from the availability point of view using Cassandra and at more than one instance of each service sounds reasonable. I would still perf test your setup to be sure.

Ok!
We have been testing a bit with GitHub - temporalio/maru: Benchmarks for Temporal workflows
and the throughput has been more than good enough. But as we are deploying so much new things at once I wanted to run it by you to sleep a bit better a night. :slight_smile:

1 Like