Need for help: Recommended Cluster Configuration in production

frontend 8c 32G x 1, such large memory is used to meet the amout of long polling requests.
history 1c 2G x 20, use horizontal scaling to avoid too much work on dynamicconfig tuning…
matching 1c 2G x 20,
worker 1c 2G x 3
cassandra 4c 8G x 5
bench-worker 2C 4G x 20,the throughput for aggregator worker depends on its poller concurrency, so I make the worker options:
MaxConcurrentActivityTaskPollers: 16
MaxConcurrentDecisionTaskPollers: 32

Just use basic-load-test-workflow, 16 tasklists and each start a basic workflow using the config below:
“useBasicVisibilityValidation”: true,
“contextTimeoutInSeconds”: 10,
“failureThreshold”: 0.01,
“totalLaunchCount”: 10000,
“routineCount”: 8,
“waitTimeBufferInSeconds”: 30,
“chainSequence”: 1,
“concurrentCount”: 1,
“payloadSizeBytes”: 256,
“executionStartToCloseTimeoutInSeconds”: 600

Cassandra CPU usage is 80%+, and I got this: 600TPS in total, TPS for one core is < 50, I don’t think is good enough, ask for help.

My apology, we are using cadence, v0.23.2, but I think tuning method is similar for same architecture.

Not sure if we can provide specific tuning tips for Cadence, were you able to reach out to their community?

Were you able to use metrics to try to pinpoint possible bottleneck(s) (server metrics persistence latencies, sync match rate, workflow/shard lock contention, sdk metrics activity and workflow task schedule to start latencies).
Also not sure what number of history shards you configure.

sync match rate is almost 1.0 and history shard is 8192.
What tps is expected to achieve for one core cassandra under basic-load-test-workflow , e.g. 1 replica factor, is 100, 3 replica factor is 80 ?