CPU Utilization with AWS RDS (MySQL)

Yimin_Chen · November 1, 2021, 4:31pm

@Lam_Tran , is the cluster with problem a newly setup cluster or does it already have data in it? Is this repro on a different cluster with mysql as DB? I’m trying to see if this is caused by some specific data that triggered some kind of bug. Also, do you have some logs from server, specifically for matching service?

Lam_Tran · November 2, 2021, 1:46am

@Yimin_Chen

The cluster being used is completely new, there is no data in it except the schema generated from temporal-sql-tools. Also MySQL/Postgres are separate instances which are built for Temporal only.

Here is some logs from the Temporal matching service: matching.logs · GitHub
and all services: temporal.logs · GitHub

Yimin_Chen · November 2, 2021, 4:28am

@Lam_Tran by any chance do you run multiple temporal clusters in the same k8s cluster? If you do, you will need to use different port for ringpop membership communication. Otherwise, pods from 2 clusters would steal shard and task queues from each other causing lots of issues. The current solution is to use different port for different clusters. See more from here: Matching service high QPS on persistence - #11 by Bo_Gao

Lam_Tran · November 2, 2021, 4:34am

@Yimin_Chen

I don’t use Kubernetes for the deployment stack. Here is my setup (using Nomad) for a temporal cluster:

There are 3 AWS EC2 instances (CentOS 7)
On each instance, there is a container running temporalio/server:1.13.0 image, so by this way 4 components (frontend, matching, worker, history) are running in the same container and discover each other via localhost (if I understand the membership communication correctly), and there is no exposed port across hosts.

I just checked from Helm charts of Temporal, 4 components are designed to run separately in different pods. So now I am trying to separate 4 containers for each of them to run separately to see if the issue is still there.

mpm · November 2, 2021, 1:54pm

It’s difficult to say without access but if there is no cross-host communication then I suspect you may actually have three independent clusters (one per host) that are all sharing a single database instance. The logs you have posted indicate that the clusters are busily stealing reasource ownership from each other, as they are unaware of the existence of their sibling clusters. This would be the cause of the high QPS.

Did you get an improved outcome running in a configuration that mirrors the Helm chart?

Lam_Tran · November 3, 2021, 7:20am

Hi @Yimin_Chen @mpm

I have reviewed the current setup and made some changes:

Current setup: 3 different containers using temporalio/server image running in 3 different nodes, so as @mpm said each of 4 (worker/history/matching/frontend) may not be aware of the existence of other ones in other nodes. Actually they discover the others via localhost only if I understand correctly.
I have separated this container into 4 different containers, each of them runs single component only (same as Temporal Helm Charts) and seen that they are now aware of the others from all hosts. Please help me to check the logs below of worker component:

{“level”:“info”,“ts”:“2021-11-03T12:03:44.005+0700”,“msg”:“bootstrap hosts fetched”,“service”:“worker”,“bootstrap-hostports”:“10.64.7.211:6934,10.64.7.156:6933,10.64.7.156:6934,10.64.7.156:6935,10.64.7.172:6939,10.64.7.172:6933,10.64.7.172:6934,10.64.7.211:6935,10.64.7.156:6939,10.64.7.172:6935,10.64.7.211:6939,10.64.7.211:6933”,“logging-call-at”:“rpMonitor.go:263”}
{“level”:“info”,“ts”:“2021-11-03T12:03:44.016+0700”,“msg”:“Current reachable members”,“service”:“worker”,“component”:“service-resolver”,“service”:“worker”,“addresses”:[“10.64.7.172:7239”,“10.64.7.211:7239”,“10.64.7.156:7239”],“logging-call-at”:“rpServiceResolver.go:266”}
{“level”:“info”,“ts”:“2021-11-03T12:03:44.017+0700”,“msg”:“Current reachable members”,“service”:“worker”,“component”:“service-resolver”,“service”:“frontend”,“addresses”:[“10.64.7.211:7233”,“10.64.7.156:7233”,“10.64.7.172:7233”],“logging-call-at”:“rpServiceResolver.go:266”}
{“level”:“info”,“ts”:“2021-11-03T12:03:44.017+0700”,“msg”:“Current reachable members”,“service”:“worker”,“component”:“service-resolver”,“service”:“matching”,“addresses”:[“10.64.7.172:7235”,“10.64.7.211:7235”,“10.64.7.156:7235”],“logging-call-at”:“rpServiceResolver.go:266”}
{“level”:“info”,“ts”:“2021-11-03T12:03:44.018+0700”,“msg”:“Current reachable members”,“service”:“worker”,“component”:“service-resolver”,“service”:“history”,“addresses”:[“10.64.7.156:7234”,“10.64.7.172:7234”,“10.64.7.211:7234”],“logging-call-at”:“rpServiceResolver.go:266”}

After running for ~ 4 hours, I don’t see the high cpu utilization problem on the AWS RDS (MySQL 5)

Btw, can you suggest if I should use the MySQL 5.7 or MySQL 8 as a persistent storage for Temporal ?

Thanks,

Lam.

Yimin_Chen · November 3, 2021, 8:53pm

Please use MySQL 5.7. We are in the process of validating MySQL 8 as supported version.

Lam_Tran · November 4, 2021, 2:01am

Thanks @Yimin_Chen @alex @mpm for your support!

lunne · November 18, 2021, 10:21pm

Would you be willing yo share your nomad job?

We tried the same thing but ran in to a weird delay problem. Trying one process with all services against the same Cassandra/Elastic database solved the problem but we would like to run each service on a separate container.

jonsve · November 18, 2021, 11:18pm

I would also love to see that

Lam_Tran · November 19, 2021, 2:22am

@lunne @jonsve

Here is my Nomad job: temporal.nomad · GitHub

Hope this can help you.

lunne · November 19, 2021, 6:28am

thanks a lot! i’ll look into it!

lunne · November 19, 2021, 6:30am

Quick followup question so I can understand the job a bit better. What version of Nomad are you running this on ?

Lam_Tran · November 19, 2021, 9:06am

@lunne I’m using Nomad 0.11.3. There are 3 EC2 instances, each of them runs Nomad server/client

madhu · March 14, 2022, 6:03pm

I see the same problem in 1.15.x , mine in k8s setup, 2 workers, 2 front end , 2 history and 1 matching.
not sure whats going wrong.

Topic		Replies	Views
CPU Utilization with AWS RDS (MariaDB) Community Support mysql	0	705	October 5, 2021
Settings / Recommendations for orchestrating microservices with temporal and MySQL Community Support go-sdk , mysql , helm , performance	10	2932	February 23, 2021
Temporal performance with golang microservice, Cassandra & Elasticsearch Community Support go-sdk , elasticsearch , cassandra , docker , performance	14	3481	February 1, 2023
Errors in frontend and hisotry Community Support history , worker , grpc	26	2383	April 9, 2022
RESOURCE_EXHAUSTED: Too many outstanding requests to the service Community Support	5	1913	May 17, 2021

CPU Utilization with AWS RDS (MySQL)

Related topics