We are testing the performance of Temporal with Aurora MySQL 5.7. Our setup is listed below:
DB: r5.4xlarge (16v CPU, 128G memory).
Temporal: deployed on k8s with istio integration. 2x matching pods, 2x frontend pods, 8x history pods, all with resource limit of 1000m cpu and 2G memory.
Temporal config: 512 shards, DB connection pool size is 20. We use default number of task queue partitions.
Maru config: concurrency 10, ratePerSecond: 20, 40, 80
Our test scenario: a single workflow with 5 Activities (each activity will sleep for a given period of time, which add up to around 1.5s. Input payload and output payload for each activity are 500 bytes each). We use the Maru framework to test it (customised workflow and activities).
At a rate of 20 workflow/s, the DB CPU is sitting low (35%), and 4 history nodes are enough to handle the load. the p95 latency for startworkflow request is < 100ms. p95 latency for all the history persistence operations are < 20ms.
At a rate of 40 workflow/s with 8x history nodes, the DB CPU is at 60%. p95 latency of the startworkflow request of the frontend is still < 100ms (suspicious since p95 latency of this operation in the history service is 200ms). p95 of other history persistence operations are 50ms-100ms.
At a rate of 80 workflow/s with 8x history nodes, the DB CPU is still around 60%. p95 latency of the startworkflow request of the frontend is 450ms, p95 latency of history persistence operations are similar to what we get for 40 w/s. And it can only get to 60 w/s creation rate. We tried to increase the concurrency, but the performance got worse.
For all the load test runs, we observed that the DB spend most of the time on Aurora_redo_log_flush wait. This indicates a lot of write load on Aurora. We tried with innodb-flush-log-at-trx-commit set to 0 (which will only flush log every 1s instead of for each transaction). This significantly reduced the CPU usage of the DB (to below 30%) while maintain the throughput. But since it can lose committed transactions for up to 1s, it’s not acceptable for our use case.
We have several questions:
- Is this performance expected on Aurora Mysql DB? Are there any more knobs we can turn to increase the performance?
- Seems like the Mysql DB is the bottleneck and the high Aurora_redo_log_flush wait indicates that we won’t be able to get better performance by adding more history pods. It is suggested that we should have 200 shards for each history service. Could you elaborate why? with the DB as the bottleneck, will it help if we increase the history shards?
We can share the details of the test report if requires. Any suggestions are much appreciated.