Unable to bootstrap ringpop

I was wondering if anyone familiar with the ringpop service has run into this issue: I’ve set up separate kubernetes deployments for each service [frontend, history, matching, worker] based off of the helm chart, and think I’m close to getting the kinks worked out… but I’m getting the following error in the logs for each service that I hope someone can help with:

{“level”:“error”,“ts”:“2021-02-26T16:23:50.091Z”,“msg”:“unable to bootstrap ringpop. retrying”,“service”:“matching”,“error”:“join duration of 42.351297515s exceeded max 30s”,“logging-call-at”:“ringpop.go:114”,“stacktrace”:“go.temporal.io/server/common/log/loggerimpl.(*loggerImpl).Error\n\t/temporal/common/log/loggerimpl/logger.go:138\ngo.temporal.io/server/common/membership.(*RingPop).bootstrap\n\t/temporal/common/membership/ringpop.go:114\ngo.temporal.io/server/common/membership.(*RingPop).Start\n\t/temporal/common/membership/ringpop.go:83\ngo.temporal.io/server/common/membership.(*ringpopMonitor).Start\n\t/temporal/common/membership/rpMonitor.go:120\ngo.temporal.io/server/common/resource.(*Impl).Start\n\t/temporal/common/resource/resourceImpl.go:371\ngo.temporal.io/server/service/matching.(*Service).Start\n\t/temporal/service/matching/service.go:100\ngo.temporal.io/server/temporal.(*Server).Start.func1\n\t/temporal/temporal/server.go:187”}

On the 6th attempt it gives up and the service restarts. I suspect that the ringpop timeout has to do with a connection issue, but I don’t know how it tries to connect. I’ve verified that each service can connect to the others via name resolution (eg. from temporal-history I can ‘ping temporal-matching-headless’ successfully). When I run a netstat on temporal-frontend I see a lot of close_wait connections to the other services once the recv-q reaches 34:

tcp 34 0 temporal-frontend-7967c57655-ng7r4:7233 172-17-0-9.temporal-history-headless.default.svc.cluster.local:55062 CLOSE_WAIT
tcp 33 0 temporal-frontend-7967c57655-ng7r4:7233 172-17-0-14.temporal-matching-headless.default.svc.cluster.local:58796 ESTABLISHED
tcp 34 0 temporal-frontend-7967c57655-ng7r4:7233 172-17-0-14.temporal-matching-headless.default.svc.cluster.local:57544 CLOSE_WAIT

Because I don’t know what the ringpop service does I’m confused about how or why it’s failing… I could start digging into the code, but thought I’d see if anyone here could point me in the right direction. Does the ringpop service need additional ports open on the containers? Is there a specific config that I’m missing?

I’m using docker image ‘temporalio/auto-setup:1.7.0’

Some of the config properties I’m using:

# Services settings
BIND_ON_IP=0.0.0.0
# Frontend deployment settings
FRONTEND_GRPC_PORT=7233
FRONTEND_MEMBERSHIP_PORT=6933
# Matching deployment settings
MATCHING_GRPC_PORT=7235
MATCHING_MEMBERSHIP_PORT=6935
# History deployment settings
HISTORY_GRPC_PORT=7234
HISTORY_MEMBERSHIP_PORT=6934
# Worker deployment settings
WORKER_GRPC_PORT=7239
WORKER_MEMBERSHIP_PORT=6939
# To override the public client host port. (default is $BIND_ON_IP:$FRONTEND_GRPC_PORT)
PUBLIC_FRONTEND_ADDRESS=temporal-frontend.default.svc.cluster.local:$FRONTEND_GRPC_PORT

Ah, I think I resolved it. For anyone who has this issue, check the TEMPORAL_BROADCAST_ADDRESS env (global.membership.broadcastAddress) for each deployment. I had set it to ‘status.hostIP’ which is the node’s IP address when I needed to set it to ‘status.podIP’. The ringpop service was attempting to connect to each service on the same wrong IP…

So in each Deployment manifest I define the env:

- name: TEMPORAL_BROADCAST_ADDRESS
  valueFrom:
    fieldRef:
      fieldPath: status.podIP

And the cluster seems happy.

2 Likes