CPU Utilization with AWS RDS (MySQL)

Lam_Tran · October 27, 2021, 3:24pm

Hi, has anyone faced an issue with CPU utilization in AWS RDS ? Here is my setup

Temporal v1.13.0
RDS instance db.t3.small, running MySQL 8 (I tried MySQL 5.7 in RDS but still getting high cpu)
No traffic comes in, just deploy Temporal on 3 EC2 nodes (via Nomad /containers)
I used mysqladmin status command and saw that the number of queries per seconds is not low I think

Uptime: 24470 Threads: 16 Questions: 15216291 Slow queries: 0 Opens: 907 Flush tables: 1 Open tables: 532 Queries per second avg: 621.834

The output of processlist is below

mysql> show full processlist;
+------+-------------+-------------------+----------+---------+------+----------+-----------------------+
| Id   | User        | Host              | db       | Command | Time | State    | Info                  |
+------+-------------+-------------------+----------+---------+------+----------+-----------------------+
|    4 | rdsadmin    | localhost:29260   | NULL     | Sleep   |    2 |          | NULL                  |
| 2599 | mysql5_root | 10.64.7.18:46378  | NULL     | Query   |    0 | starting | show full processlist |
| 2935 | temporal    | 10.64.7.156:40408 | temporal | Sleep   |   16 |          | NULL                  |
| 2940 | temporal    | 10.64.7.211:37512 | temporal | Sleep   |    3 |          | NULL                  |
| 2941 | temporal    | 10.64.7.156:40492 | temporal | Sleep   |    6 |          | NULL                  |
| 2942 | temporal    | 10.64.7.211:37561 | temporal | Sleep   |    6 |          | NULL                  |
| 2943 | temporal    | 10.64.7.211:37560 | temporal | Sleep   |    6 |          | NULL                  |
| 2944 | temporal    | 10.64.7.211:37564 | temporal | Sleep   |    3 |          | NULL                  |
| 2945 | temporal    | 10.64.7.172:35308 | temporal | Sleep   |    3 |          | NULL                  |
| 2946 | temporal    | 10.64.7.172:35428 | temporal | Sleep   |    7 |          | NULL                  |
| 2947 | temporal    | 10.64.7.172:35426 | temporal | Sleep   |    7 |          | NULL                  |
| 2948 | temporal    | 10.64.7.172:35430 | temporal | Sleep   |    7 |          | NULL                  |
| 2949 | temporal    | 10.64.7.172:35432 | temporal | Sleep   |    1 |          | NULL                  |
| 2950 | temporal    | 10.64.7.172:35450 | temporal | Sleep   |    8 |          | NULL                  |
| 2951 | temporal    | 10.64.7.172:35458 | temporal | Sleep   |   17 |          | NULL                  |
| 2952 | temporal    | 10.64.7.172:35460 | temporal | Sleep   |    8 |          | NULL                  |
| 2953 | temporal    | 10.64.7.172:35464 | temporal | Sleep   |    7 |          | NULL                  |
| 2954 | temporal    | 10.64.7.172:35462 | temporal | Sleep   |    7 |          | NULL                  |
| 2955 | temporal    | 10.64.7.172:35466 | temporal | Sleep   |   17 |          | NULL                  |
| 2956 | temporal    | 10.64.7.156:40732 | temporal | Sleep   |    3 |          | NULL                  |
| 2957 | temporal    | 10.64.7.156:40734 | temporal | Sleep   |    7 |          | NULL                  |
| 2958 | temporal    | 10.64.7.156:40736 | temporal | Sleep   |    3 |          | NULL                  |
| 2959 | temporal    | 10.64.7.156:40738 | temporal | Sleep   |    7 |          | NULL                  |
| 2960 | temporal    | 10.64.7.156:40752 | temporal | Sleep   |   14 |          | NULL                  |
| 2961 | temporal    | 10.64.7.211:37794 | temporal | Sleep   |    6 |          | NULL                  |
| 2962 | temporal    | 10.64.7.156:40810 | temporal | Sleep   |    7 |          | NULL                  |
| 2963 | temporal    | 10.64.7.156:40812 | temporal | Sleep   |    7 |          | NULL                  |
| 2964 | temporal    | 10.64.7.156:40814 | temporal | Sleep   |    6 |          | NULL                  |
| 2965 | temporal    | 10.64.7.211:37870 | temporal | Sleep   |    2 |          | NULL                  |
| 2966 | temporal    | 10.64.7.211:37886 | temporal | Sleep   |    2 |          | NULL                  |
| 2967 | temporal    | 10.64.7.211:37890 | temporal | Sleep   |    2 |          | NULL                  |
| 2968 | temporal    | 10.64.7.211:37892 | temporal | Sleep   |    2 |          | NULL                  |
| 2969 | temporal    | 10.64.7.211:37894 | temporal | Sleep   |    2 |          | NULL                  |
+------+-------------+-------------------+----------+---------+------+----------+-----------------------+

I also use another MySQL 8 instance running in GCP CloudSQL but don’t see this problem.

Can you guys suggest the minimum resources (cpu/ram) to run the Temporal on MySQL ?

Thanks

longndb · October 28, 2021, 3:58am

In the same issue

Yimin_Chen · October 28, 2021, 5:37am

@Lam_Tran , do you have the top queries that were executed? Or could you share the persistence metrics group by operations? If you are using grafana/promtheus, the query for metrics would looks something like: sum by (operation) (rate(persistence_requests[1m])).
Or better yet, if you could take a goroutine dump and share with us, that will greatly help us quickly identify the problem. Thank you.

Lam_Tran · October 28, 2021, 6:10am

Hi @Yimin_Chen

I am checking with my team to get some metrics from RDS (currently we are using Datadog to view)
Btw, how to enable pprof to take the goroutine dump ? I am using Nomad to deploy the Temporal as containers, the way I provision configuration for Temporal is the ENV from this template: temporal/config_template.yaml at master · temporalio/temporal · GitHub. This does not have pprof configuration.

Thanks

Lam_Tran · October 28, 2021, 6:18am

Hi @Yimin_Chen

Here is some metrics from our Datadog Agent

I think that the select rate is too high which can lead to the cpu utilization problem.

Lam_Tran · October 28, 2021, 6:26am

We also have another RDS instance (db.t3.large) and it runs MySQL 5.7. There are some metrics getting from Performance Insight of AWS RDS, hope this can help

alex · October 28, 2021, 7:25pm

Unfortunately pprof is disabled by default in production images and even removed from config completely. To make it work you need to modify generated config/docker.yaml manually or modify config_template.yaml and regenerate config with dockerize. You need to modify global section by adding these two lines:

global:
  pprof:
    port: 7936

Yimin_Chen · October 28, 2021, 7:43pm

Once you have the pprof port setup, you can take a goroutine dump by open a port forwarding to your container:

kubectl port-forward your_temporal_history_pod_id 7936:7936 -n your_kubenate_namespace

Then, you can fetch a goroutine dump by:

wget 'http://localhost:7936/debug/pprof/goroutine?debug=2' -O goroutine.txt

Lam_Tran · October 29, 2021, 6:34am

Hi @alex and @Yimin_Chen

Here is the goroutine.txt:

gist.github.com

https://gist.github.com/ngoclamtran/ad826e959862304b2e88e8e2a6cf0aaf#file-goroutine-txt

goroutine.txt

goroutine 1358679 [running]:
runtime/pprof.writeGoroutineStacks({0x235d7a0, 0xc000828a80})
	/usr/local/go/src/runtime/pprof/pprof.go:693 +0x70
runtime/pprof.writeGoroutine({0x235d7a0, 0xc000828a80}, 0x0)
	/usr/local/go/src/runtime/pprof/pprof.go:682 +0x2b
runtime/pprof.(*Profile).WriteTo(0x1e1e9c0, {0x235d7a0, 0xc000828a80}, 0xc)
	/usr/local/go/src/runtime/pprof/pprof.go:331 +0x14b
net/http/pprof.handler.ServeHTTP({0xc001c38161, 0x34934e0}, {0x2386300, 0xc000828a80}, 0xc001c38154)
	/usr/local/go/src/net/http/pprof/pprof.go:253 +0x49a
net/http/pprof.Index({0x2386300, 0xc000828a80}, 0xc000734900)

This file has been truncated. show original

alex · October 29, 2021, 6:39am

From the first sight nothing bad there.

Let me double check. After upgrade to 1.13, you see high (almost 100%) CPU load on your MySQL nodes and Temporal executes something like 6k SELECT queries per second. Did I get it right?

Lam_Tran · October 29, 2021, 6:43am

Thanks @alex

Actually I see the cpu load issue since v1.12.3 and after upgrading to v1.13.0, it’s still there.

alex · October 29, 2021, 6:44am

ok. This definitely will help. What was the last version you tried and it didn’t have it?

Lam_Tran · October 29, 2021, 6:49am

From the 1st time I used Temporal (1.11.x) on AWS RDS, the CPU load issue already happened, it actually consumes the CPU of RDS much, by time, not peak immediately (even using MariaDB 10.3, MySQL 5.7 and MySQL 8.0 on different instance types)
When I compare to the same DB instance type using MySQL 8 on GCP CloudSQL, there is no issue (Temporal 1.12 and 1.13), CPU ~ 10-20% (instance has 1 vCPU)

Yimin_Chen · October 29, 2021, 7:08pm

How many pods for frontend/history/matching/worker do you have in your setup? How many history shard do you have in your config? (From the goroutine count, it seems like default 4 shard). And from the goroutine dump, it seems you run all 4 service role frontend/history/matching/worker in one process?

Lam_Tran · October 30, 2021, 2:37am

Hi @Yimin_Chen

My setup is:

4 services frontend/history/matching/worker are running in one process, I use image temporalio/server:1.13.0 and do not update the SERVICES env
There are 3 nodes (EC2 on AWS), each node will run one process as a container (deployed via Nomad)
Default shard is used (4)

Yimin_Chen · October 30, 2021, 4:08pm

@Lam_Tran , from the goroutine dump, there is nothing out of ordinary. In fact, there is only one goroutine with call stack in persistence layer and it is idle on purge timer for namespace replication task. There is literally no work happening on the persistence layer when this dump was taken.
Is it possible that our of your 3 temporal node, only one of them is going crazy, but was not the one you captured the dump? Also, is it possible to get some sample queries that was executed by your database? And did you setup metrics, it would be very useful to see which persistence API was called by temproal. Sample query for promethus would be sum by (operation) (rate(persistence_requests[1m])) .

Lam_Tran · November 1, 2021, 4:21am

Hi @Yimin_Chen

Let me share 3 goroutines dump output from 3 nodes first. For the database metrics and application metrics from Prometheus, I will share soon, after finishing the setup.

Node 0: node-0 · GitHub
Node 1: node-1 · GitHub
Node 2: node-2 · GitHub

Lam_Tran · November 1, 2021, 10:03am

Hi @Yimin_Chen

Here is the metrics grouping by the operation in Prometheus. From this we can see that the count of GetTask operation is very high.

Lam_Tran · November 1, 2021, 4:04pm

Hi @Yimin_Chen

I have built another AWS RDS instance using Postgresql and deploy 3 instances of Temporal on it (same deployment stack via Nomad, each of container on a node uses image temporalio/server). The result from Prometheus looks different:

Yimin_Chen · November 1, 2021, 4:13pm

@Lam_Tran thank you for the info, we will dig into this shortly and get back to you once we find anything.

Topic		Replies	Views
CPU Utilization with AWS RDS (MariaDB) Community Support mysql	0	705	October 5, 2021
Settings / Recommendations for orchestrating microservices with temporal and MySQL Community Support go-sdk , mysql , helm , performance	10	2932	February 23, 2021
Temporal performance with golang microservice, Cassandra & Elasticsearch Community Support go-sdk , elasticsearch , cassandra , docker , performance	14	3481	February 1, 2023
Errors in frontend and hisotry Community Support history , worker , grpc	26	2383	April 9, 2022
RESOURCE_EXHAUSTED: Too many outstanding requests to the service Community Support	5	1913	May 17, 2021

CPU Utilization with AWS RDS (MySQL)

Related topics