CPU Utilization with AWS RDS (MySQL)

Hi, has anyone faced an issue with CPU utilization in AWS RDS ? Here is my setup

  • Temporal v1.13.0
  • RDS instance db.t3.small, running MySQL 8 (I tried MySQL 5.7 in RDS but still getting high cpu)
  • No traffic comes in, just deploy Temporal on 3 EC2 nodes (via Nomad /containers)
  • I used mysqladmin status command and saw that the number of queries per seconds is not low I think

Uptime: 24470 Threads: 16 Questions: 15216291 Slow queries: 0 Opens: 907 Flush tables: 1 Open tables: 532 Queries per second avg: 621.834

  • The output of processlist is below
mysql> show full processlist;
+------+-------------+-------------------+----------+---------+------+----------+-----------------------+
| Id   | User        | Host              | db       | Command | Time | State    | Info                  |
+------+-------------+-------------------+----------+---------+------+----------+-----------------------+
|    4 | rdsadmin    | localhost:29260   | NULL     | Sleep   |    2 |          | NULL                  |
| 2599 | mysql5_root | 10.64.7.18:46378  | NULL     | Query   |    0 | starting | show full processlist |
| 2935 | temporal    | 10.64.7.156:40408 | temporal | Sleep   |   16 |          | NULL                  |
| 2940 | temporal    | 10.64.7.211:37512 | temporal | Sleep   |    3 |          | NULL                  |
| 2941 | temporal    | 10.64.7.156:40492 | temporal | Sleep   |    6 |          | NULL                  |
| 2942 | temporal    | 10.64.7.211:37561 | temporal | Sleep   |    6 |          | NULL                  |
| 2943 | temporal    | 10.64.7.211:37560 | temporal | Sleep   |    6 |          | NULL                  |
| 2944 | temporal    | 10.64.7.211:37564 | temporal | Sleep   |    3 |          | NULL                  |
| 2945 | temporal    | 10.64.7.172:35308 | temporal | Sleep   |    3 |          | NULL                  |
| 2946 | temporal    | 10.64.7.172:35428 | temporal | Sleep   |    7 |          | NULL                  |
| 2947 | temporal    | 10.64.7.172:35426 | temporal | Sleep   |    7 |          | NULL                  |
| 2948 | temporal    | 10.64.7.172:35430 | temporal | Sleep   |    7 |          | NULL                  |
| 2949 | temporal    | 10.64.7.172:35432 | temporal | Sleep   |    1 |          | NULL                  |
| 2950 | temporal    | 10.64.7.172:35450 | temporal | Sleep   |    8 |          | NULL                  |
| 2951 | temporal    | 10.64.7.172:35458 | temporal | Sleep   |   17 |          | NULL                  |
| 2952 | temporal    | 10.64.7.172:35460 | temporal | Sleep   |    8 |          | NULL                  |
| 2953 | temporal    | 10.64.7.172:35464 | temporal | Sleep   |    7 |          | NULL                  |
| 2954 | temporal    | 10.64.7.172:35462 | temporal | Sleep   |    7 |          | NULL                  |
| 2955 | temporal    | 10.64.7.172:35466 | temporal | Sleep   |   17 |          | NULL                  |
| 2956 | temporal    | 10.64.7.156:40732 | temporal | Sleep   |    3 |          | NULL                  |
| 2957 | temporal    | 10.64.7.156:40734 | temporal | Sleep   |    7 |          | NULL                  |
| 2958 | temporal    | 10.64.7.156:40736 | temporal | Sleep   |    3 |          | NULL                  |
| 2959 | temporal    | 10.64.7.156:40738 | temporal | Sleep   |    7 |          | NULL                  |
| 2960 | temporal    | 10.64.7.156:40752 | temporal | Sleep   |   14 |          | NULL                  |
| 2961 | temporal    | 10.64.7.211:37794 | temporal | Sleep   |    6 |          | NULL                  |
| 2962 | temporal    | 10.64.7.156:40810 | temporal | Sleep   |    7 |          | NULL                  |
| 2963 | temporal    | 10.64.7.156:40812 | temporal | Sleep   |    7 |          | NULL                  |
| 2964 | temporal    | 10.64.7.156:40814 | temporal | Sleep   |    6 |          | NULL                  |
| 2965 | temporal    | 10.64.7.211:37870 | temporal | Sleep   |    2 |          | NULL                  |
| 2966 | temporal    | 10.64.7.211:37886 | temporal | Sleep   |    2 |          | NULL                  |
| 2967 | temporal    | 10.64.7.211:37890 | temporal | Sleep   |    2 |          | NULL                  |
| 2968 | temporal    | 10.64.7.211:37892 | temporal | Sleep   |    2 |          | NULL                  |
| 2969 | temporal    | 10.64.7.211:37894 | temporal | Sleep   |    2 |          | NULL                  |
+------+-------------+-------------------+----------+---------+------+----------+-----------------------+
  • I also use another MySQL 8 instance running in GCP CloudSQL but don’t see this problem.

Can you guys suggest the minimum resources (cpu/ram) to run the Temporal on MySQL ?

Thanks

In the same issue

@Lam_Tran , do you have the top queries that were executed? Or could you share the persistence metrics group by operations? If you are using grafana/promtheus, the query for metrics would looks something like: sum by (operation) (rate(persistence_requests[1m])).
Or better yet, if you could take a goroutine dump and share with us, that will greatly help us quickly identify the problem. Thank you.

Hi @Yimin_Chen

I am checking with my team to get some metrics from RDS (currently we are using Datadog to view)
Btw, how to enable pprof to take the goroutine dump ? I am using Nomad to deploy the Temporal as containers, the way I provision configuration for Temporal is the ENV from this template: temporal/config_template.yaml at master · temporalio/temporal · GitHub. This does not have pprof configuration.

Thanks

Hi @Yimin_Chen

Here is some metrics from our Datadog Agent



I think that the select rate is too high which can lead to the cpu utilization problem.

We also have another RDS instance (db.t3.large) and it runs MySQL 5.7. There are some metrics getting from Performance Insight of AWS RDS, hope this can help

Unfortunately pprof is disabled by default in production images and even removed from config completely. To make it work you need to modify generated config/docker.yaml manually or modify config_template.yaml and regenerate config with dockerize. You need to modify global section by adding these two lines:

global:
  pprof:
    port: 7936

Once you have the pprof port setup, you can take a goroutine dump by open a port forwarding to your container:

kubectl port-forward your_temporal_history_pod_id 7936:7936 -n your_kubenate_namespace

Then, you can fetch a goroutine dump by:

wget 'http://localhost:7936/debug/pprof/goroutine?debug=2' -O goroutine.txt

Hi @alex and @Yimin_Chen

Here is the goroutine.txt:

From the first sight nothing bad there.

Let me double check. After upgrade to 1.13, you see high (almost 100%) CPU load on your MySQL nodes and Temporal executes something like 6k SELECT queries per second. Did I get it right?

Thanks @alex

Actually I see the cpu load issue since v1.12.3 and after upgrading to v1.13.0, it’s still there.

ok. This definitely will help. What was the last version you tried and it didn’t have it?

From the 1st time I used Temporal (1.11.x) on AWS RDS, the CPU load issue already happened, it actually consumes the CPU of RDS much, by time, not peak immediately (even using MariaDB 10.3, MySQL 5.7 and MySQL 8.0 on different instance types)
When I compare to the same DB instance type using MySQL 8 on GCP CloudSQL, there is no issue (Temporal 1.12 and 1.13), CPU ~ 10-20% (instance has 1 vCPU)

How many pods for frontend/history/matching/worker do you have in your setup? How many history shard do you have in your config? (From the goroutine count, it seems like default 4 shard). And from the goroutine dump, it seems you run all 4 service role frontend/history/matching/worker in one process?

Hi @Yimin_Chen

My setup is:

  • 4 services frontend/history/matching/worker are running in one process, I use image temporalio/server:1.13.0 and do not update the SERVICES env
  • There are 3 nodes (EC2 on AWS), each node will run one process as a container (deployed via Nomad)
  • Default shard is used (4)

@Lam_Tran , from the goroutine dump, there is nothing out of ordinary. In fact, there is only one goroutine with call stack in persistence layer and it is idle on purge timer for namespace replication task. There is literally no work happening on the persistence layer when this dump was taken.
Is it possible that our of your 3 temporal node, only one of them is going crazy, but was not the one you captured the dump? Also, is it possible to get some sample queries that was executed by your database? And did you setup metrics, it would be very useful to see which persistence API was called by temproal. Sample query for promethus would be sum by (operation) (rate(persistence_requests[1m])) .

Hi @Yimin_Chen

Let me share 3 goroutines dump output from 3 nodes first. For the database metrics and application metrics from Prometheus, I will share soon, after finishing the setup.

Node 0: node-0 · GitHub
Node 1: node-1 · GitHub
Node 2: node-2 · GitHub

Hi @Yimin_Chen

Here is the metrics grouping by the operation in Prometheus. From this we can see that the count of GetTask operation is very high.

Hi @Yimin_Chen

I have built another AWS RDS instance using Postgresql and deploy 3 instances of Temporal on it (same deployment stack via Nomad, each of container on a node uses image temporalio/server). The result from Prometheus looks different:

@Lam_Tran thank you for the info, we will dig into this shortly and get back to you once we find anything.