EKS Deployment: Worker Crash Loop

Anyone have any ideas what could be causing this failure in the worker pod?

{"level":"info","ts":"2021-08-13T14:28:32.717Z","msg":"Service resources started","service":"worker","address":"10.220.68.135:7239","logging-call-at":"resourceImpl.go:406"}
{"level":"info","ts":"2021-08-13T14:28:32.723Z","msg":"Get dynamic config","name":"worker.executionsScannerEnabled","value":false,"default-value":false,"logging-call-at":"config.go:79"}
{"level":"info","ts":"2021-08-13T14:28:32.723Z","msg":"Get dynamic config","name":"worker.historyScannerEnabled","value":true,"default-value":true,"logging-call-at":"config.go:79"}
{"level":"fatal","ts":"2021-08-13T14:28:42.904Z","msg":"error starting scanner","service":"worker","error":"context deadline exceeded","logging-call-at":"service.go:242","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Fatal\n\t/temporal/common/log/zap_logger.go:151\ngo.temporal.io/server/service/worker.(*Service).startScanner\n\t/temporal/service/worker/service.go:242\ngo.temporal.io/server/service/worker.(*Service).Start\n\t/temporal/service/worker/service.go:163\ngo.temporal.io/server/temporal.(*Server).Start.func1\n\t/temporal/temporal/server.go:231"}

So, I think it has to do with my attempt to configure a multi-cluster setup. When I use the following clusterMetadata, the worker (in east1) fails to start, which I suspect is due to the scanner not being able to connect to the temporaltest-east2 cluster. When i comment out that temporaltest-east2 clusterInformation entry, the worker is able to start successfully.

    clusterMetadata:
      enableGlobalDomain: true
      failoverVersionIncrement: 10
      masterClusterName: "temporaltest-east1"
      currentClusterName: "temporaltest-east1"
      clusterInformation:
        temporaltest-east1:
          enabled: true
          initialFailoverVersion: 1
          rpcName: "temporaltest-frontend-east1"
          ##
          # Use cluster-local host:port for the current cluster
          ##
          rpcAddress: "temporaltest-frontend:7233"
        temporaltest-east2:
          enabled: true
          initialFailoverVersion: 2
          rpcName: "temporaltest-frontend-east2"
          rpcAddress: "dns:///temporaltest-frontend-east2.[restofdsnnamehere]:433"
      replicationConsumer:
        type: rpc

My question is, should the rpcAddress point to the frontend RPC port or the membership port? It currently points at the RPC port via an ALB-fronted DNS entry pointing at the frontend service (ALB:443 > EKS SVC:7233 > frontend pod:7233).

@arnesenfamily - I am also facing similar issue for worker service deployed as individual pod. What is the solution for this?

{“level”:“info”,“ts”:“2021-09-06T12:42:05.131Z”,“msg”:“Service resources started”,“service”:“worker”,“address”:“100.127.29.47:7239”,“logging-call-at”:“resourceImpl.go:406”}
{“level”:“info”,“ts”:“2021-09-06T12:42:05.153Z”,“msg”:“Get dynamic config”,“name”:“worker.executionsScannerEnabled”,“value”:false,“default-value”:false,“logging-call-at”:“config.go:79”}
{“level”:“info”,“ts”:“2021-09-06T12:42:05.153Z”,“msg”:“Get dynamic config”,“name”:“worker.taskQueueScannerEnabled”,“value”:true,“default-value”:true,“logging-call-at”:“config.go:79”}
{“level”:“debug”,“ts”:“2021-09-06T12:42:05.209Z”,“msg”:“Membership heartbeat upserted successfully”,“service”:“worker”,“address”:“100.127.29.47”,“port”:6939,“hostId”:“d72392a2-0f0f-11ec-bb3f-2e171ec82380”,“logging-call-at”:“rpMonitor.go:163”}
{“level”:“fatal”,“ts”:“2021-09-06T12:42:15.289Z”,“msg”:“error starting scanner”,“service”:“worker”,“error”:“context deadline exceeded”,“logging-call-at”:“service.go:242”,“stacktrace”:“go.temporal.io/server/common/log.(*zapLogger).Fatal\n\t/temporal/common/log/zap_logger.go:151\ngo.temporal.io/server/service/worker.(*Service).startScanner\n\t/temporal/service/worker/service.go:242\ngo.temporal.io/server/service/worker.(*Service).Start\n\t/temporal/service/worker/service.go:163\ngo.temporal.io/server/temporal.(*Server).Start.func1\n\t/temporal/temporal/server.go:231”}

I’m actually not sure what causes that specific issue. I haven’t dug into the root cause of the issue. So far, I just run helm uninstall, wait for a few minutes and then helm install. I suspect it’s related to this issue (or this issue), which I’m seeing when pods start. Unfortunately, I don’t have direct control over the AWS-node daemonset or ingress controller config outside of the ingress annotations that are available. I have a team looking into the issue to see about resolving IP address allocations to pods.