Possible regression in v1.11.4 to 1.12.x - very high CPU usage in matching service, without any workload

Hello

I am doing a temporal deployment using Nomad + Consul (Hashicorp). Just for context, Nomad is a container orchestrator which can be seen as an alternative to Kubernetes. Till v1.11.4, things seem to be working all fine, but on v1.12.0 or v1.12.1, I am seeing extremely high CPU usage (4000%) on the matching service. From debug logs, it seems that there is some kind of an infinite loop kind of situation happening.

{"level":"debug","ts":"2021-09-27T04:09:02.367Z","msg":"Failed to fetch key from dynamic config","key":"matching.forwarderMaxOutstandingPolls","error":"unable to find key","logging-call-at":"config.go:67"}
{"level":"debug","ts":"2021-09-27T04:09:02.372Z","msg":"Failed to fetch key from dynamic config","key":"matching.forwarderMaxOutstandingPolls","error":"unable to find key","logging-call-at":"config.go:67"}
{"level":"debug","ts":"2021-09-27T04:09:02.376Z","msg":"Failed to fetch key from dynamic config","key":"matching.forwarderMaxChildrenPerNode","error":"unable to find key","logging-call-at":"config.go:67"}
{"level":"debug","ts":"2021-09-27T04:09:02.380Z","msg":"Failed to fetch key from dynamic config","key":"matching.forwarderMaxOutstandingPolls","error":"unable to find key","logging-call-at":"config.go:67"}
{"level":"debug","ts":"2021-09-27T04:09:02.384Z","msg":"Failed to fetch key from dynamic config","key":"matching.forwarderMaxChildrenPerNode","error":"unable to find key","logging-call-at":"config.go:67"}
{"level":"debug","ts":"2021-09-27T04:09:02.389Z","msg":"Failed to fetch key from dynamic config","key":"matching.forwarderMaxChildrenPerNode","error":"unable to find key","logging-call-at":"config.go:67"}
{"level":"debug","ts":"2021-09-27T04:09:02.397Z","msg":"Failed to fetch key from dynamic config","key":"matching.forwarderMaxChildrenPerNode","error":"unable to find key","logging-call-at":"config.go:67"}
{"level":"debug","ts":"2021-09-27T04:09:02.411Z","msg":"Failed to fetch key from dynamic config","key":"matching.forwarderMaxChildrenPerNode","error":"unable to find key","logging-call-at":"config.go:67"}
{"level":"debug","ts":"2021-09-27T04:09:02.423Z","msg":"Failed to fetch key from dynamic config","key":"matching.forwarderMaxChildrenPerNode","error":"unable to find key","logging-call-at":"config.go:67"}
{"level":"debug","ts":"2021-09-27T04:09:02.483Z","msg":"Failed to fetch key from dynamic config","key":"matching.forwarderMaxChildrenPerNode","error":"unable to find key","logging-call-at":"config.go:67"}
{"level":"debug","ts":"2021-09-27T04:09:02.488Z","msg":"Failed to fetch key from dynamic config","key":"matching.forwarderMaxOutstandingPolls","error":"unable to find key","logging-call-at":"config.go:67"}
{"level":"debug","ts":"2021-09-27T04:09:02.492Z","msg":"Failed to fetch key from dynamic config","key":"matching.forwarderMaxOutstandingPolls","error":"unable to find key","logging-call-at":"config.go:67"}
{"level":"debug","ts":"2021-09-27T04:09:02.497Z","msg":"Failed to fetch key from dynamic config","key":"matching.forwarderMaxOutstandingPolls","error":"unable to find key","logging-call-at":"config.go:67"}
{"level":"debug","ts":"2021-09-27T04:09:02.501Z","msg":"Failed to fetch key from dynamic config","key":"matching.forwarderMaxOutstandingPolls","error":"unable to find key","logging-call-at":"config.go:67"}
{"level":"debug","ts":"2021-09-27T04:09:02.505Z","msg":"Failed to fetch key from dynamic config","key":"matching.forwarderMaxChildrenPerNode","error":"unable to find key","logging-call-at":"config.go:67"}
{"level":"debug","ts":"2021-09-27T04:09:02.509Z","msg":"Failed to fetch key from dynamic config","key":"matching.forwarderMaxOutstandingPolls","error":"unable to find key","logging-call-at":"config.go:67"}
{"level":"debug","ts":"2021-09-27T04:09:02.517Z","msg":"Failed to fetch key from dynamic config","key":"matching.forwarderMaxOutstandingPolls","error":"unable to find key","logging-call-at":"config.go:67"}
{"level":"debug","ts":"2021-09-27T04:09:02.534Z","msg":"Failed to fetch key from dynamic config","key":"matching.forwarderMaxChildrenPerNode","error":"unable to find key","logging-call-at":"config.go:67"}
{"level":"debug","ts":"2021-09-27T04:09:02.548Z","msg":"Failed to fetch key from dynamic config","key":"matching.forwarderMaxChildrenPerNode","error":"unable to find key","logging-call-at":"config.go:67"}
{"level":"debug","ts":"2021-09-27T04:09:02.605Z","msg":"Failed to fetch key from dynamic config","key":"matching.forwarderMaxChildrenPerNode","error":"unable to find key","logging-call-at":"config.go:67"}
{"level":"debug","ts":"2021-09-27T04:09:02.609Z","msg":"Failed to fetch key from dynamic config","key":"matching.forwarderMaxChildrenPerNode","error":"unable to find key","logging-call-at":"config.go:67"}
{"level":"debug","ts":"2021-09-27T04:09:02.613Z","msg":"Failed to fetch key from dynamic config","key":"matching.forwarderMaxChildrenPerNode","error":"unable to find key","logging-call-at":"config.go:67"}
{"level":"debug","ts":"2021-09-27T04:09:02.617Z","msg":"Failed to fetch key from dynamic config","key":"matching.forwarderMaxChildrenPerNode","error":"unable to find key","logging-call-at":"config.go:67"}
{"level":"debug","ts":"2021-09-27T04:09:02.621Z","msg":"Failed to fetch key from dynamic config","key":"matching.forwarderMaxChildrenPerNode","error":"unable to find key","logging-call-at":"config.go:67"}
{"level":"debug","ts":"2021-09-27T04:09:02.626Z","msg":"Failed to fetch key from dynamic config","key":"matching.forwarderMaxChildrenPerNode","error":"unable to find key","logging-call-at":"config.go:67"}
{"level":"debug","ts":"2021-09-27T04:09:02.634Z","msg":"Failed to fetch key from dynamic config","key":"matching.forwarderMaxChildrenPerNode","error":"unable to find key","logging-call-at":"config.go:67"}

For the matching service docker container, my nomad config passes the following environment variables

      env {
        KEYSPACE = var.keyspace
        VISIBILITY_KEYSPACE = var.visibility_keyspace
        SKIP_SCHEMA_SETUP = true
        NUM_HISTORY_SHARDS = var.history_shard_count
        SERVICES = "worker"
        LOG_LEVEL = var.log_level
        DYNAMIC_CONFIG_FILE_PATH = "config/dynamicconfig"
        TEMPORAL_BROADCAST_ADDRESS = "${attr.unique.network.ip-address}"
        BIND_ON_IP = "0.0.0.0"
        PUBLIC_FRONTEND_ADDRESS = "${NOMAD_UPSTREAM_ADDR_temporal_frontend_grpc}"
      }

The file config/dynamicconfig does not exist as of now. I had just kept it in case I need to use it in future, but I’ve found the environment variables sufficient to do the configuration.

Hi @animesh, can you enable profiling by specifying a port as per Configure the Temporal Server | Temporal documentation
and then attach the results of a ten second profiling report captured with:

curl http://ip :port/debug/pprof/trace?seconds=10 -o trace.out

this should help us analyze the issue further. Thanks!

1 Like

Hello @tihomir

I’m sorry if I am missing something obvious here, but how do I get this pprof port configuration in if I am deploying using docker. It seems docker.yaml gets generated using the config_template.yaml, and I can’t find an option to specify some other yaml file which I can myself create and mount inside the container.

@animesh sorry I don’t have much experience with Nomad. If anyone does please help. The idea is to mount a file into a container at deploy time. Is that possible?

I tried that. But I think the docker entrypoint.sh file overwrites it. It has this line

dockerize -template ./config/config_template.yaml:./config/docker.yaml

So I can’t find a simple way to mount for eg. custom_config.yaml to the docker container and tell the temporal server to use that file instead. It seems the only way would be to create a custom docker image. Any other ideas?

PS. it would be so much easier if the temporal team adds pprof settings into the official image’s config_template.yaml, and make them configurable using environment vars passed to the container.

Ok finally found a way to add custom config. Temporal looks for a base.yaml file and loads it first, after which it loads the file by the same name as the env argument. Thankfully there is no base.yaml in the docker container, and I am mounting one to enable pprof.

Here’s a link to the trace file -
https://c91-misc-bucket.s3.ap-south-1.amazonaws.com/local_temporal_config_trace.out

@animesh thanks! will have the server team look at it and report back.

@animesh
we looked at the trace and it seems to indicate that ringpop maintenance is taking an unusually large proportion of the CPU. Can you maybe check Tx/Rx drops or network partitions, try to find if there are any network health issues on your end.
Related to that, the traces show the proportions of CPU time, but not the absolute time, meaning that if there was something else running on the cluster that’s eating up CPU then we would not be able to tell. Not suspecting there is, but just saying that there are limitations of what we can see with logs and traces alone, so any additional inputs on your end as to the overall network traffic and CPU usage goes would be great info to have.

Still looking into the logs and should have more updates soon :slight_smile:

aah ok. Well the network config is exactly the same, the only thing I am changing is the version number. 1.11.4 works absolutely fine. I am literally just changing that to 1.12 and redeploying with everything else the same, on the same VMs etc. Reverting back to 1.11.4 immediately restores things back to normal. And I am doing this without any workflows running. Let me also think if there is some other information I can provide.

@animesh thanks for the info! from looking at your logs and some internal discussions, we opened issue Remove task polling recursion by mmcshane · Pull Request #2022 · temporalio/temporal · GitHub. team is still discussing details. follow that issue for more info (and feel free to add your further findings there as well)