CPU Utilization with AWS RDS (MySQL)

@Lam_Tran , is the cluster with problem a newly setup cluster or does it already have data in it? Is this repro on a different cluster with mysql as DB? I’m trying to see if this is caused by some specific data that triggered some kind of bug. Also, do you have some logs from server, specifically for matching service?

@Yimin_Chen

The cluster being used is completely new, there is no data in it except the schema generated from temporal-sql-tools. Also MySQL/Postgres are separate instances which are built for Temporal only.

Here is some logs from the Temporal matching service: matching.logs · GitHub
and all services: temporal.logs · GitHub

@Lam_Tran by any chance do you run multiple temporal clusters in the same k8s cluster? If you do, you will need to use different port for ringpop membership communication. Otherwise, pods from 2 clusters would steal shard and task queues from each other causing lots of issues. The current solution is to use different port for different clusters. See more from here: Matching service high QPS on persistence - #11 by Bo_Gao

@Yimin_Chen

I don’t use Kubernetes for the deployment stack. Here is my setup (using Nomad) for a temporal cluster:

  • There are 3 AWS EC2 instances (CentOS 7)
  • On each instance, there is a container running temporalio/server:1.13.0 image, so by this way 4 components (frontend, matching, worker, history) are running in the same container and discover each other via localhost (if I understand the membership communication correctly), and there is no exposed port across hosts.

I just checked from Helm charts of Temporal, 4 components are designed to run separately in different pods. So now I am trying to separate 4 containers for each of them to run separately to see if the issue is still there.

It’s difficult to say without access but if there is no cross-host communication then I suspect you may actually have three independent clusters (one per host) that are all sharing a single database instance. The logs you have posted indicate that the clusters are busily stealing reasource ownership from each other, as they are unaware of the existence of their sibling clusters. This would be the cause of the high QPS.

Did you get an improved outcome running in a configuration that mirrors the Helm chart?

Hi @Yimin_Chen @mpm

I have reviewed the current setup and made some changes:

  • Current setup: 3 different containers using temporalio/server image running in 3 different nodes, so as @mpm said each of 4 (worker/history/matching/frontend) may not be aware of the existence of other ones in other nodes. Actually they discover the others via localhost only if I understand correctly.
  • I have separated this container into 4 different containers, each of them runs single component only (same as Temporal Helm Charts) and seen that they are now aware of the others from all hosts. Please help me to check the logs below of worker component:

{“level”:“info”,“ts”:“2021-11-03T12:03:44.005+0700”,“msg”:“bootstrap hosts fetched”,“service”:“worker”,“bootstrap-hostports”:“10.64.7.211:6934,10.64.7.156:6933,10.64.7.156:6934,10.64.7.156:6935,10.64.7.172:6939,10.64.7.172:6933,10.64.7.172:6934,10.64.7.211:6935,10.64.7.156:6939,10.64.7.172:6935,10.64.7.211:6939,10.64.7.211:6933”,“logging-call-at”:“rpMonitor.go:263”}
{“level”:“info”,“ts”:“2021-11-03T12:03:44.016+0700”,“msg”:“Current reachable members”,“service”:“worker”,“component”:“service-resolver”,“service”:“worker”,“addresses”:[“10.64.7.172:7239”,“10.64.7.211:7239”,“10.64.7.156:7239”],“logging-call-at”:“rpServiceResolver.go:266”}
{“level”:“info”,“ts”:“2021-11-03T12:03:44.017+0700”,“msg”:“Current reachable members”,“service”:“worker”,“component”:“service-resolver”,“service”:“frontend”,“addresses”:[“10.64.7.211:7233”,“10.64.7.156:7233”,“10.64.7.172:7233”],“logging-call-at”:“rpServiceResolver.go:266”}
{“level”:“info”,“ts”:“2021-11-03T12:03:44.017+0700”,“msg”:“Current reachable members”,“service”:“worker”,“component”:“service-resolver”,“service”:“matching”,“addresses”:[“10.64.7.172:7235”,“10.64.7.211:7235”,“10.64.7.156:7235”],“logging-call-at”:“rpServiceResolver.go:266”}
{“level”:“info”,“ts”:“2021-11-03T12:03:44.018+0700”,“msg”:“Current reachable members”,“service”:“worker”,“component”:“service-resolver”,“service”:“history”,“addresses”:[“10.64.7.156:7234”,“10.64.7.172:7234”,“10.64.7.211:7234”],“logging-call-at”:“rpServiceResolver.go:266”}

After running for ~ 4 hours, I don’t see the high cpu utilization problem on the AWS RDS (MySQL 5)

Btw, can you suggest if I should use the MySQL 5.7 or MySQL 8 as a persistent storage for Temporal ?

Thanks,

Lam.

Please use MySQL 5.7. We are in the process of validating MySQL 8 as supported version.

Thanks @Yimin_Chen @alex @mpm for your support!

1 Like

Would you be willing yo share your nomad job?

We tried the same thing but ran in to a weird delay problem. Trying one process with all services against the same Cassandra/Elastic database solved the problem but we would like to run each service on a separate container.

1 Like

I would also love to see that

@lunne @jonsve

Here is my Nomad job: temporal.nomad · GitHub

Hope this can help you.

thanks a lot! i’ll look into it!

Quick followup question so I can understand the job a bit better. What version of Nomad are you running this on ?

@lunne I’m using Nomad 0.11.3. There are 3 EC2 instances, each of them runs Nomad server/client

I see the same problem in 1.15.x , mine in k8s setup, 2 workers, 2 front end , 2 history and 1 matching.
not sure whats going wrong.