Unable to bootstrap Ringpop when Strict MTLS is enabled in Istio

Hi Team,

We are currently trying to introduce temporal workflows into our production application and for that I am trying to deploy a self hosted temporal server using the helm charts on to a GKE cluster with postgresql DB.

The server deployment is successful and all the services are talking to each other without any issues when the Istio MTLS is set to PERMISSIVE mode.

But the Temporal services are unable to talk to each other when Istio mtls set in STRICT mode. Its failing with Unable to bootstrap Ringpop error.

{"level":"info","ts":"2023-08-23T16:39:04.597Z","msg":"Membership heartbeat upserted successfully","address":"10.68.0.151","port":6939,"hostId":"92680b58-41d3-11ee-b3b3-56731cb242a4","logging-call-at":"monitor.go:256"}
{"level":"info","ts":"2023-08-23T16:39:04.601Z","msg":"bootstrap hosts fetched","bootstrap-hostports":"10.68.0.148:6934,10.68.0.147:6933,10.68.0.149:6935,10.68.0.151:6939","logging-call-at":"monitor.go:298"}
{"level":"warn","ts":"2023-08-23T16:39:50.772Z","msg":"unable to bootstrap ringpop. retrying","error":"join duration of 46.17039065s exceeded max 30s","logging-call-at":"ringpop.go:110"}
{"level":"info","ts":"2023-08-23T16:39:59.947Z","msg":"bootstrap hosts fetched","bootstrap-hostports":"10.68.0.148:6934,10.68.0.149:6935,10.68.0.151:6939","logging-call-at":"monitor.go:298"}
{"level":"error","ts":"2023-08-23T16:40:04.587Z","msg":"start failed","component":"fx","error":"context deadline exceeded","logging-call-at":"fx.go:1120","stacktrace":"go.temporal.io/server/common/log

I looked at the other discussions about setting the POD_IP as the broadcast address and also the other suggestion to set the appProtocol: tcp but still seeing the same issue.

https://community.temporal.io/t/temporal-workload-unable-to-talk-to-each-other-when-strct-mtls-enabled-in-istio/6650
https://community.temporal.io/t/unable-to-bootstrap-ringpop/1597

Here are the config, service and deployments manifests

temporaltest-worker-headless service

apiVersion: v1
kind: Service
metadata:
  annotations:
    cloud.google.com/neg: '{"ingress":true}'
    meta.helm.sh/release-name: temporaltest
    meta.helm.sh/release-namespace: temp1
    service.alpha.kubernetes.io/tolerate-unready-endpoints: "true"
  creationTimestamp: "2023-08-23T15:32:50Z"
  labels:
    app.kubernetes.io/component: worker
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: temporal
  name: temporaltest-worker-headless
  namespace: temp1
  resourceVersion: "685687"
  uid: 858fc87e-afff-440c-9e6c-69aad7a270ea
spec:
  clusterIP: None
  clusterIPs:
  - None
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - appProtocol: tcp
    name: grpc-rpc
    port: 7239
    protocol: TCP
    targetPort: rpc
  - appProtocol: http
    name: metrics
    port: 9090
    protocol: TCP
    targetPort: metrics
  publishNotReadyAddresses: true
  selector:
    app.kubernetes.io/component: worker
    app.kubernetes.io/name: temporal
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}

ConfigMap

global:
      membership:
        name: temporal
        maxJoinDuration: 30s
        broadcastAddress: {{ default .Env.POD_IP "0.0.0.0" }}

      pprof:
        port: 7936

      metrics:
        tags:
          type: worker
        prometheus:
          timerType: histogram
          listenAddress: "0.0.0.0:9090"

    services:
      frontend:
        rpc:
          grpcPort: 7233
          membershipPort: 6933
          bindOnIP: "0.0.0.0"

      history:
        rpc:
          grpcPort: 7234
          membershipPort: 6934
          bindOnIP: "0.0.0.0"

      matching:
        rpc:
          grpcPort: 7235
          membershipPort: 6935
          bindOnIP: "0.0.0.0"

      worker:
        rpc:
          grpcPort: 7239
          membershipPort: 6939
          bindOnIP: "0.0.0.0"

deployment

 spec:
      containers:
      - args:
        - sleep 10
        env:
        - name: POD_IP
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.podIP
        - name: ENABLE_ES
        - name: ES_SEEDS
          value: elasticsearch-master-headless
        - name: ES_PORT
          value: "9200"
        - name: ES_VERSION
          value: v7
        - name: ES_SCHEME
          value: http
        - name: ES_VIS_INDEX
          value: temporal_visibility_v1_dev
        - name: ES_USER
        - name: ES_PWD
        - name: SERVICES
          value: worker
        - name: SQL_TLS
          value: "true"
        - name: SQL_TLS_DISABLE_HOST_VERIFICATION
          value: "true"
        - name: TEMPORAL_STORE_PASSWORD
          valueFrom:
            secretKeyRef:
              key: password
              name: temporal-default-store
        - name: TEMPORAL_VISIBILITY_STORE_PASSWORD
          valueFrom:
            secretKeyRef:
              key: password
              name: temporal-visibility-store
        image: temporalio/server:1.21.3
        imagePullPolicy: IfNotPresent
        name: temporal-worker
        ports:
        - containerPort: 7239
          name: rpc
          protocol: TCP
        - containerPort: 9090
          name: metrics
          protocol: TCP

Please check and advice if something needs to be fixed in the config. Let me know if any additional information is needed.

Thanks in advance.

I don’t see the membership port in your Service:

  - name: grpc-membership
    port: 6933
    appProtocol: tcp
    protocol: TCP
    targetPort: membership

From what I remember, this is not part of the Helm chart and I had to locate the membership port in the Temporal config and add it myself for each Temporal Service resource.

@craigd Added the membership port in to the all the temporal services. But still seeing the same Unable to bootstrap ringpop issue.

  ports:
  - appProtocol: tcp
    name: grpc-rpc
    port: 7239
    protocol: TCP
    targetPort: rpc
  - appProtocol: tcp
    name: grpc-membership
    port: 6939
    protocol: TCP
    targetPort: membership
  - appProtocol: http
    name: metrics
    port: 9090
    protocol: TCP
    targetPort: metrics
  publishNotReadyAddresses: true
spec:
  type: ClusterIP
  clusterIP: None
  publishNotReadyAddresses: true
  ports:
    - port: {{ $serviceValues.service.port }}
      targetPort: rpc
      appProtocol: tcp
      protocol: TCP
      name: grpc-rpc
    - name: grpc-membership
      port: {{ $serviceValues.service.membershipPort }}
      appProtocol: tcp
      protocol: TCP
      targetPort: membership
    - port: 9090
      targetPort: metrics
      appProtocol: http
      protocol: TCP
      name: metrics
  selector:
    app.kubernetes.io/name: {{ include "temporal.name" $ }}
    # app.kubernetes.io/instance: {{ $.Release.Name }}
    app.kubernetes.io/component: {{ $service }}
{"level":"info","ts":"2023-08-24T12:57:07.834Z","msg":"Membership heartbeat upserted successfully","address":"10.68.0.172","port":6939,"hostId":"bb5fd046-427d-11ee-bc02-8eb23dc129af","logging-call-at":"monitor.go:256"}
{"level":"info","ts":"2023-08-24T12:57:07.843Z","msg":"bootstrap hosts fetched","bootstrap-hostports":"10.68.0.170:6935,10.68.0.172:6939","logging-call-at":"monitor.go:298"}
{"level":"warn","ts":"2023-08-24T12:57:42.907Z","msg":"unable to bootstrap ringpop. retrying","error":"join duration of 35.060116221s exceeded max 30s","logging-call-at":"ringpop.go:110"}
{"level":"info","ts":"2023-08-24T12:57:52.223Z","msg":"bootstrap hosts fetched","bootstrap-hostports":"10.68.0.170:6935,10.68.0.172:6939,10.68.0.173:6933,10.68.0.175:6934","logging-call-at":"monitor.go:298"}
{"level":"error","ts":"2023-08-24T12:58:07.817Z","msg":"start failed","component":"fx","error":"context deadline exceeded",

Do you see any other change that’s missing or anything that’s incorrectly configured?

Thanks

Afraid not. Behaviours will differ with different Istio versions. Your next steps might need to be looking at Envoy logs on the sidecar to see what is happennig with the traffic.

Due to the overhead of supporting Istio in our small team in STRICT mode with many 3rd party applications we actually dropped it from our solution in favour of NetworkPolicies so I am not running it in any of our enviroments.

Sorry I couldn’t be more help.

Thank you @craigd. Really appreciate the quick response.

I will dig more into the issue and see if i can find a root cause.

@rahulk Were you able to find a fix for this? We are in the same boat, and are still facing the same issue even after modifying the helm chart to have the membershipPort in each svc