Crash loop of history service in K8s cluster

We have all the services running independently now. Each service is configured with the pod ID as the broadcastAddress (as per prior art: Unable to bootstrap ringpop )

Logs for each look like

	2021-04-28 13:54:22.043	
temporal-history
preprod
{"level":"info","ts":"2021-04-28T20:54:22.043Z","msg":"Membership heartbeat upserted successfully","service":"history","address":"100.99.199.16","port":6934,"hostId":"e8783778-a863-11eb-bffb-f611ef047b3b","logging-call-at":"rpMonitor.go:222"}
2021-04-28 13:54:22.045	
temporal-history
preprod
{"level":"info","ts":"2021-04-28T20:54:22.045Z","msg":"bootstrap hosts fetched","service":"history","bootstrap-hostports":"100.99.198.10:6933,100.99.199.6:6939,100.99.155.32:6933,100.99.199.16:6934","logging-call-at":"rpMonitor.go:263"}
2021-04-28 13:54:59.323	
temporal-history
preprod
{"level":"error","ts":"2021-04-28T20:54:59.323Z","msg":"unable to bootstrap ringpop. retrying","service":"history","error":"join duration of 37.278092306s exceeded max 30s","logging-call-at":"ringpop.go:114","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:136\ngo.temporal.io/server/common/membership.(*RingPop).bootstrap\n\t/temporal/common/membership/ringpop.go:114\ngo.temporal.io/server/common/membership.(*RingPop).Start\n\t/temporal/common/membership/ringpop.go:83\ngo.temporal.io/server/common/membership.(*ringpopMonitor).Start\n\t/temporal/common/membership/rpMonitor.go:120\ngo.temporal.io/server/common/resource.(*Impl).Start\n\t/temporal/common/resource/resourceImpl.go:371\ngo.temporal.io/server/service/history.(*Service).Start\n\t/temporal/service/history/service.go:149\ngo.temporal.io/server/temporal.(*Server).Start.func1\n\t/temporal/temporal/server.go:192"}
2021-04-28 13:40:21.544	
temporal-matching
preprod
{"level":"error","ts":"2021-04-28T20:40:21.544Z","msg":"unable to bootstrap ringpop. retrying","service":"matching","error":"join duration of 38.590264357s exceeded max 30s","logging-call-at":"ringpop.go:114","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:136\ngo.temporal.io/server/common/membership.(*RingPop).bootstrap\n\t/temporal/common/membership/ringpop.go:114\ngo.temporal.io/server/common/membership.(*RingPop).Start\n\t/temporal/common/membership/ringpop.go:83\ngo.temporal.io/server/common/membership.(*ringpopMonitor).Start\n\t/temporal/common/membership/rpMonitor.go:120\ngo.temporal.io/server/common/resource.(*Impl).Start\n\t/temporal/common/resource/resourceImpl.go:371\ngo.temporal.io/server/service/matching.(*Service).Start\n\t/temporal/service/matching/service.go:102\ngo.temporal.io/server/temporal.(*Server).Start.func1\n\t/temporal/temporal/server.go:192"}
2021-04-28 13:40:28.418	
temporal-matching
preprod
{"level":"error","ts":"2021-04-28T20:40:28.418Z","msg":"unable to bootstrap ringpop. retrying","service":"matching","error":"join duration of 42.19803248s exceeded max 30s","logging-call-at":"ringpop.go:114","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:136\ngo.temporal.io/server/common/membership.(*RingPop).bootstrap\n\t/temporal/common/membership/ringpop.go:114\ngo.temporal.io/server/common/membership.(*RingPop).Start\n\t/temporal/common/membership/ringpop.go:83\ngo.temporal.io/server/common/membership.(*ringpopMonitor).Start\n\t/temporal/common/membership/rpMonitor.go:120\ngo.temporal.io/server/common/resource.(*Impl).Start\n\t/temporal/common/resource/resourceImpl.go:371\ngo.temporal.io/server/service/matching.(*Service).Start\n\t/temporal/service/matching/service.go:102\ngo.temporal.io/server/temporal.(*Server).Start.func1\n\t/temporal/temporal/server.go:192"}

etc.

We can telnet from one host to another using IPs and the membership ports. Not sure what to try next. Our outbound requests go through istio, but since we’re using IP address nothing there should be impactful.