Unable to bootstrap ringpop

I was wondering if anyone familiar with the ringpop service has run into this issue: I’ve set up separate kubernetes deployments for each service [frontend, history, matching, worker] based off of the helm chart, and think I’m close to getting the kinks worked out… but I’m getting the following error in the logs for each service that I hope someone can help with:

{“level”:“error”,“ts”:“2021-02-26T16:23:50.091Z”,“msg”:“unable to bootstrap ringpop. retrying”,“service”:“matching”,“error”:“join duration of 42.351297515s exceeded max 30s”,“logging-call-at”:“ringpop.go:114”,“stacktrace”:“go.temporal.io/server/common/log/loggerimpl.(*loggerImpl).Error\n\t/temporal/common/log/loggerimpl/logger.go:138\ngo.temporal.io/server/common/membership.(*RingPop).bootstrap\n\t/temporal/common/membership/ringpop.go:114\ngo.temporal.io/server/common/membership.(*RingPop).Start\n\t/temporal/common/membership/ringpop.go:83\ngo.temporal.io/server/common/membership.(*ringpopMonitor).Start\n\t/temporal/common/membership/rpMonitor.go:120\ngo.temporal.io/server/common/resource.(*Impl).Start\n\t/temporal/common/resource/resourceImpl.go:371\ngo.temporal.io/server/service/matching.(*Service).Start\n\t/temporal/service/matching/service.go:100\ngo.temporal.io/server/temporal.(*Server).Start.func1\n\t/temporal/temporal/server.go:187”}

On the 6th attempt it gives up and the service restarts. I suspect that the ringpop timeout has to do with a connection issue, but I don’t know how it tries to connect. I’ve verified that each service can connect to the others via name resolution (eg. from temporal-history I can ‘ping temporal-matching-headless’ successfully). When I run a netstat on temporal-frontend I see a lot of close_wait connections to the other services once the recv-q reaches 34:

tcp 34 0 temporal-frontend-7967c57655-ng7r4:7233 172-17-0-9.temporal-history-headless.default.svc.cluster.local:55062 CLOSE_WAIT
tcp 33 0 temporal-frontend-7967c57655-ng7r4:7233 172-17-0-14.temporal-matching-headless.default.svc.cluster.local:58796 ESTABLISHED
tcp 34 0 temporal-frontend-7967c57655-ng7r4:7233 172-17-0-14.temporal-matching-headless.default.svc.cluster.local:57544 CLOSE_WAIT

Because I don’t know what the ringpop service does I’m confused about how or why it’s failing… I could start digging into the code, but thought I’d see if anyone here could point me in the right direction. Does the ringpop service need additional ports open on the containers? Is there a specific config that I’m missing?

I’m using docker image ‘temporalio/auto-setup:1.7.0’

Some of the config properties I’m using:

# Services settings
# Frontend deployment settings
# Matching deployment settings
# History deployment settings
# Worker deployment settings
# To override the public client host port. (default is $BIND_ON_IP:$FRONTEND_GRPC_PORT)

Ah, I think I resolved it. For anyone who has this issue, check the TEMPORAL_BROADCAST_ADDRESS env (global.membership.broadcastAddress) for each deployment. I had set it to ‘status.hostIP’ which is the node’s IP address when I needed to set it to ‘status.podIP’. The ringpop service was attempting to connect to each service on the same wrong IP…

So in each Deployment manifest I define the env:

      fieldPath: status.podIP

And the cluster seems happy.