EKS Deployment: Worker Crash Loop

arnesenfamily · August 13, 2021, 2:47pm

Anyone have any ideas what could be causing this failure in the worker pod?

{"level":"info","ts":"2021-08-13T14:28:32.717Z","msg":"Service resources started","service":"worker","address":"10.220.68.135:7239","logging-call-at":"resourceImpl.go:406"}
{"level":"info","ts":"2021-08-13T14:28:32.723Z","msg":"Get dynamic config","name":"worker.executionsScannerEnabled","value":false,"default-value":false,"logging-call-at":"config.go:79"}
{"level":"info","ts":"2021-08-13T14:28:32.723Z","msg":"Get dynamic config","name":"worker.historyScannerEnabled","value":true,"default-value":true,"logging-call-at":"config.go:79"}
{"level":"fatal","ts":"2021-08-13T14:28:42.904Z","msg":"error starting scanner","service":"worker","error":"context deadline exceeded","logging-call-at":"service.go:242","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Fatal\n\t/temporal/common/log/zap_logger.go:151\ngo.temporal.io/server/service/worker.(*Service).startScanner\n\t/temporal/service/worker/service.go:242\ngo.temporal.io/server/service/worker.(*Service).Start\n\t/temporal/service/worker/service.go:163\ngo.temporal.io/server/temporal.(*Server).Start.func1\n\t/temporal/temporal/server.go:231"}

arnesenfamily · August 13, 2021, 2:55pm

So, I think it has to do with my attempt to configure a multi-cluster setup. When I use the following clusterMetadata, the worker (in east1) fails to start, which I suspect is due to the scanner not being able to connect to the temporaltest-east2 cluster. When i comment out that temporaltest-east2 clusterInformation entry, the worker is able to start successfully.

    clusterMetadata:
      enableGlobalDomain: true
      failoverVersionIncrement: 10
      masterClusterName: "temporaltest-east1"
      currentClusterName: "temporaltest-east1"
      clusterInformation:
        temporaltest-east1:
          enabled: true
          initialFailoverVersion: 1
          rpcName: "temporaltest-frontend-east1"
          ##
          # Use cluster-local host:port for the current cluster
          ##
          rpcAddress: "temporaltest-frontend:7233"
        temporaltest-east2:
          enabled: true
          initialFailoverVersion: 2
          rpcName: "temporaltest-frontend-east2"
          rpcAddress: "dns:///temporaltest-frontend-east2.[restofdsnnamehere]:433"
      replicationConsumer:
        type: rpc

arnesenfamily · August 13, 2021, 2:59pm

My question is, should the rpcAddress point to the frontend RPC port or the membership port? It currently points at the RPC port via an ALB-fronted DNS entry pointing at the frontend service (ALB:443 > EKS SVC:7233 > frontend pod:7233).

jaffarsadik · September 6, 2021, 1:38pm

@arnesenfamily - I am also facing similar issue for worker service deployed as individual pod. What is the solution for this?

{“level”:“info”,“ts”:“2021-09-06T12:42:05.131Z”,“msg”:“Service resources started”,“service”:“worker”,“address”:“100.127.29.47:7239”,“logging-call-at”:“resourceImpl.go:406”}
{“level”:“info”,“ts”:“2021-09-06T12:42:05.153Z”,“msg”:“Get dynamic config”,“name”:“worker.executionsScannerEnabled”,“value”:false,“default-value”:false,“logging-call-at”:“config.go:79”}
{“level”:“info”,“ts”:“2021-09-06T12:42:05.153Z”,“msg”:“Get dynamic config”,“name”:“worker.taskQueueScannerEnabled”,“value”:true,“default-value”:true,“logging-call-at”:“config.go:79”}
{“level”:“debug”,“ts”:“2021-09-06T12:42:05.209Z”,“msg”:“Membership heartbeat upserted successfully”,“service”:“worker”,“address”:“100.127.29.47”,“port”:6939,“hostId”:“d72392a2-0f0f-11ec-bb3f-2e171ec82380”,“logging-call-at”:“rpMonitor.go:163”}
{“level”:“fatal”,“ts”:“2021-09-06T12:42:15.289Z”,“msg”:“error starting scanner”,“service”:“worker”,“error”:“context deadline exceeded”,“logging-call-at”:“service.go:242”,“stacktrace”:“go.temporal.io/server/common/log.(*zapLogger).Fatal\n\t/temporal/common/log/zap_logger.go:151\ngo.temporal.io/server/service/worker.(*Service).startScanner\n\t/temporal/service/worker/service.go:242\ngo.temporal.io/server/service/worker.(*Service).Start\n\t/temporal/service/worker/service.go:163\ngo.temporal.io/server/temporal.(*Server).Start.func1\n\t/temporal/temporal/server.go:231”}

arnesenfamily · September 7, 2021, 5:35pm

I’m actually not sure what causes that specific issue. I haven’t dug into the root cause of the issue. So far, I just run helm uninstall, wait for a few minutes and then helm install. I suspect it’s related to this issue (or this issue), which I’m seeing when pods start. Unfortunately, I don’t have direct control over the AWS-node daemonset or ingress controller config outside of the ingress annotations that are available. I have a team looking into the issue to see about resolving IP address allocations to pods.

Topic		Replies	Views
Worker Service Pod Crashed Community Support kubernetes	13	2645	September 15, 2021
Worker pod crasing Community Support kubernetes	3	705	May 17, 2022
Temporal-service 1.14.4 upgraded, frontend & worker pods are crashing Community Support deployment , kubernetes	30	1872	April 22, 2022
Crash loop of history service in K8s cluster Community Support history , kubernetes	19	3746	April 30, 2021
Error starting temporal-sys-tq-scanner-workflow workflow Community Support mysql	6	2048	August 13, 2020

EKS Deployment: Worker Crash Loop

Related topics