Temporal worker pod liveness probe failure

In our Temporal deployment (not helm, plain k8s manifests through Flux) the worker pod was being restarted regularly due to liveness probe failures.
Workflows run fine so it’s not causing any problems.

~ $ kubectl get po -n temporal
NAME                                    READY   STATUS    RESTARTS   AGE
temporal-admin-tools-5df46645cb-7fvzm   1/1     Running   0          33h
temporal-frontend-7787fd7d48-l4qzv      1/1     Running   0          20h
temporal-history-f6f948749-szx98        1/1     Running   0          20h
temporal-matching-d98cd66dd-nt2kk       1/1     Running   0          20h
temporal-web-65bc777746-f4tvb           1/1     Running   0          20h
temporal-worker-57b4557498-9gh26        1/1     Running   48         20h

~ $ kubectl describe po -n temporal temporal-worker-57b4557498-9gh26
Name:         temporal-worker-57b4557498-9gh26
Namespace:    temporal
[...]
Events:
  Type     Reason     Age                    From     Message
  ----     ------     ----                   ----     -------
  Warning  Unhealthy  5m14s (x144 over 19h)  kubelet  Liveness probe failed: dial tcp 11.32.104.6:7239: i/o timeout
  Normal   Killing    5m14s (x48 over 19h)   kubelet  Container temporal-worker failed liveness probe, will be restarted
  Normal   Pulled     4m44s (x48 over 19h)   kubelet  Container image "remote-docker.artifactory.swisscom.com/temporalio/server:1.10.5" already present on machine

Pod logs:

~ $ kubectl logs -n temporal temporal-worker-57b4557498-9gh26
2021/07/09 05:20:39 Loading config; env=docker,zone=,configDir=config
2021/07/09 05:20:39 Loading config files=[config/docker.yaml]
{"level":"info","ts":"2021-07-09T05:20:39.619Z","msg":"Updated dynamic config","logging-call-at":"file_based_client.go:235"}
{"level":"info","ts":"2021-07-09T05:20:39.619Z","msg":"Starting server for services","value":["worker"],"logging-call-at":"server.go:117"}
{"level":"info","ts":"2021-07-09T05:20:39.624Z","msg":"Get dynamic config","name":"system.advancedVisibilityWritingMode","value":"off","default-value":"off","logging-call-at":"config.go:79"}
{"level":"info","ts":"2021-07-09T05:20:39.713Z","msg":"PProf listen on ","port":7936,"logging-call-at":"pprof.go:73"}
{"level":"info","ts":"2021-07-09T05:20:39.739Z","msg":"Get dynamic config","name":"frontend.validSearchAttributes","value":{},"default-value":{},"logging-call-at":"config.go:79"}
{"level":"info","ts":"2021-07-09T05:20:39.774Z","msg":"Get dynamic config","name":"worker.throttledLogRPS","value":20,"default-value":20,"logging-call-at":"config.go:79"}
{"level":"info","ts":"2021-07-09T05:20:39.774Z","msg":"Created gRPC listener","service":"worker","address":"0.0.0.0:7239","logging-call-at":"rpc.go:135"}
{"level":"info","ts":"2021-07-09T05:20:39.774Z","msg":"Get dynamic config","name":"worker.persistenceGlobalMaxQPS","value":0,"default-value":0,"logging-call-at":"config.go:79"}
{"level":"info","ts":"2021-07-09T05:20:39.774Z","msg":"Get dynamic config","name":"worker.persistenceMaxQPS","value":500,"default-value":500,"logging-call-at":"config.go:79"}
{"level":"info","ts":"2021-07-09T05:20:39.803Z","msg":"worker starting","service":"worker","component":"worker","logging-call-at":"service.go:160"}
{"level":"info","ts":"2021-07-09T05:20:39.804Z","msg":"RuntimeMetricsReporter started","service":"worker","logging-call-at":"runtime.go:154"}
{"level":"info","ts":"2021-07-09T05:20:39.817Z","msg":"Membership heartbeat upserted successfully","service":"worker","address":"11.32.104.6","port":6939,"hostId":"665bf752-e075-11eb-a4de-c23e58f1f794","logging-call-at":"rpMonitor.go:222"}
{"level":"info","ts":"2021-07-09T05:20:39.826Z","msg":"bootstrap hosts fetched","service":"worker","bootstrap-hostports":"11.32.104.6:6939,11.32.104.5:6935,11.32.104.3:6934,11.32.104.4:6933","logging-call-at":"rpMonitor.go:263"}
{"level":"info","ts":"2021-07-09T05:20:39.832Z","msg":"Current reachable members","service":"worker","component":"service-resolver","service":"worker","addresses":["11.32.104.6:7239"],"logging-call-at":"rpServiceResolver.go:266"}
{"level":"info","ts":"2021-07-09T05:20:39.833Z","msg":"Current reachable members","service":"worker","component":"service-resolver","service":"frontend","addresses":["11.32.104.4:7233"],"logging-call-at":"rpServiceResolver.go:266"}
{"level":"info","ts":"2021-07-09T05:20:39.833Z","msg":"Current reachable members","service":"worker","component":"service-resolver","service":"history","addresses":["11.32.104.3:7234"],"logging-call-at":"rpServiceResolver.go:266"}
{"level":"info","ts":"2021-07-09T05:20:39.833Z","msg":"Current reachable members","service":"worker","component":"service-resolver","service":"matching","addresses":["11.32.104.5:7235"],"logging-call-at":"rpServiceResolver.go:266"}
{"level":"info","ts":"2021-07-09T05:20:39.849Z","msg":"Service resources started","service":"worker","address":"11.32.104.6:7239","logging-call-at":"resourceImpl.go:396"}
{"level":"info","ts":"2021-07-09T05:20:39.857Z","msg":"Get dynamic config","name":"worker.executionsScannerEnabled","value":false,"default-value":false,"logging-call-at":"config.go:79"}
{"level":"info","ts":"2021-07-09T05:20:39.857Z","msg":"Get dynamic config","name":"worker.taskQueueScannerEnabled","value":true,"default-value":true,"logging-call-at":"config.go:79"}
{"level":"info","ts":"2021-07-09T05:20:40.173Z","msg":"Started Worker","Namespace":"temporal-system","TaskQueue":"temporal-sys-tq-scanner-taskqueue-0","WorkerID":"12@temporal-worker-57b4557498-9gh26@","logging-call-at":"scanner.go:139"}
{"level":"info","ts":"2021-07-09T05:20:40.173Z","msg":"Get dynamic config","name":"worker.enableBatcher","value":true,"default-value":true,"logging-call-at":"config.go:79"}
{"level":"info","ts":"2021-07-09T05:20:40.189Z","msg":"Started Worker","Namespace":"temporal-system","TaskQueue":"temporal-sys-batcher-taskqueue","WorkerID":"12@temporal-worker-57b4557498-9gh26@","logging-call-at":"batcher.go:94"}
{"level":"info","ts":"2021-07-09T05:20:40.189Z","msg":"Get dynamic config","name":"system.enableParentClosePolicyWorker","value":true,"default-value":true,"logging-call-at":"config.go:79"}
{"level":"info","ts":"2021-07-09T05:20:40.208Z","msg":"Started Worker","Namespace":"temporal-system","TaskQueue":"temporal-sys-processor-parent-close-policy","WorkerID":"12@temporal-worker-57b4557498-9gh26@","logging-call-at":"processor.go:86"}
{"level":"info","ts":"2021-07-09T05:20:40.238Z","msg":"Started Worker","Namespace":"temporal-system","TaskQueue":"temporal-sys-add-search-attributes-task-queue","WorkerID":"12@temporal-worker-57b4557498-9gh26@","logging-call-at":"addsearchattributes.go:85"}
{"level":"info","ts":"2021-07-09T05:20:40.239Z","msg":"worker started","service":"worker","component":"worker","logging-call-at":"service.go:182"}
{"level":"info","ts":"2021-07-09T05:20:43.908Z","msg":"temporal-sys-tq-scanner-workflow workflow successfully started","service":"worker","logging-call-at":"scanner.go:186"}

~ $ kubectl logs --previous -n temporal temporal-worker-57b4557498-9gh26
2021/07/09 04:55:39 Loading config; env=docker,zone=,configDir=config
2021/07/09 04:55:39 Loading config files=[config/docker.yaml]
{"level":"info","ts":"2021-07-09T04:55:39.621Z","msg":"Updated dynamic config","logging-call-at":"file_based_client.go:235"}
{"level":"info","ts":"2021-07-09T04:55:39.621Z","msg":"Starting server for services","value":["worker"],"logging-call-at":"server.go:117"}
{"level":"info","ts":"2021-07-09T04:55:39.626Z","msg":"Get dynamic config","name":"system.advancedVisibilityWritingMode","value":"off","default-value":"off","logging-call-at":"config.go:79"}
{"level":"info","ts":"2021-07-09T04:55:39.715Z","msg":"PProf listen on ","port":7936,"logging-call-at":"pprof.go:73"}
{"level":"info","ts":"2021-07-09T04:55:39.743Z","msg":"Get dynamic config","name":"frontend.validSearchAttributes","value":{},"default-value":{},"logging-call-at":"config.go:79"}
{"level":"info","ts":"2021-07-09T04:55:39.780Z","msg":"Get dynamic config","name":"worker.throttledLogRPS","value":20,"default-value":20,"logging-call-at":"config.go:79"}
{"level":"info","ts":"2021-07-09T04:55:39.781Z","msg":"Created gRPC listener","service":"worker","address":"0.0.0.0:7239","logging-call-at":"rpc.go:135"}
{"level":"info","ts":"2021-07-09T04:55:39.781Z","msg":"Get dynamic config","name":"worker.persistenceGlobalMaxQPS","value":0,"default-value":0,"logging-call-at":"config.go:79"}
{"level":"info","ts":"2021-07-09T04:55:39.781Z","msg":"Get dynamic config","name":"worker.persistenceMaxQPS","value":500,"default-value":500,"logging-call-at":"config.go:79"}
{"level":"info","ts":"2021-07-09T04:55:39.810Z","msg":"worker starting","service":"worker","component":"worker","logging-call-at":"service.go:160"}
{"level":"info","ts":"2021-07-09T04:55:39.810Z","msg":"RuntimeMetricsReporter started","service":"worker","logging-call-at":"runtime.go:154"}
{"level":"info","ts":"2021-07-09T04:55:39.826Z","msg":"Membership heartbeat upserted successfully","service":"worker","address":"11.32.104.6","port":6939,"hostId":"e84aff55-e071-11eb-a37e-c23e58f1f794","logging-call-at":"rpMonitor.go:222"}
{"level":"info","ts":"2021-07-09T04:55:39.836Z","msg":"bootstrap hosts fetched","service":"worker","bootstrap-hostports":"11.32.104.5:6935,11.32.104.3:6934,11.32.104.4:6933,11.32.104.6:6939","logging-call-at":"rpMonitor.go:263"}
{"level":"info","ts":"2021-07-09T04:55:39.842Z","msg":"Current reachable members","service":"worker","component":"service-resolver","service":"worker","addresses":["11.32.104.6:7239"],"logging-call-at":"rpServiceResolver.go:266"}
{"level":"info","ts":"2021-07-09T04:55:39.843Z","msg":"Current reachable members","service":"worker","component":"service-resolver","service":"frontend","addresses":["11.32.104.4:7233"],"logging-call-at":"rpServiceResolver.go:266"}
{"level":"info","ts":"2021-07-09T04:55:39.843Z","msg":"Current reachable members","service":"worker","component":"service-resolver","service":"history","addresses":["11.32.104.3:7234"],"logging-call-at":"rpServiceResolver.go:266"}
{"level":"info","ts":"2021-07-09T04:55:39.843Z","msg":"Current reachable members","service":"worker","component":"service-resolver","service":"matching","addresses":["11.32.104.5:7235"],"logging-call-at":"rpServiceResolver.go:266"}
{"level":"info","ts":"2021-07-09T04:55:39.858Z","msg":"Service resources started","service":"worker","address":"11.32.104.6:7239","logging-call-at":"resourceImpl.go:396"}
{"level":"info","ts":"2021-07-09T04:55:39.865Z","msg":"Get dynamic config","name":"worker.executionsScannerEnabled","value":false,"default-value":false,"logging-call-at":"config.go:79"}
{"level":"info","ts":"2021-07-09T04:55:39.865Z","msg":"Get dynamic config","name":"worker.taskQueueScannerEnabled","value":true,"default-value":true,"logging-call-at":"config.go:79"}
{"level":"info","ts":"2021-07-09T04:55:40.162Z","msg":"Started Worker","Namespace":"temporal-system","TaskQueue":"temporal-sys-tq-scanner-taskqueue-0","WorkerID":"13@temporal-worker-57b4557498-9gh26@","logging-call-at":"scanner.go:139"}
{"level":"info","ts":"2021-07-09T04:55:40.162Z","msg":"Get dynamic config","name":"worker.enableBatcher","value":true,"default-value":true,"logging-call-at":"config.go:79"}
{"level":"info","ts":"2021-07-09T04:55:40.184Z","msg":"Started Worker","Namespace":"temporal-system","TaskQueue":"temporal-sys-batcher-taskqueue","WorkerID":"13@temporal-worker-57b4557498-9gh26@","logging-call-at":"batcher.go:94"}
{"level":"info","ts":"2021-07-09T04:55:40.184Z","msg":"Get dynamic config","name":"system.enableParentClosePolicyWorker","value":true,"default-value":true,"logging-call-at":"config.go:79"}
{"level":"info","ts":"2021-07-09T04:55:40.201Z","msg":"Started Worker","Namespace":"temporal-system","TaskQueue":"temporal-sys-processor-parent-close-policy","WorkerID":"13@temporal-worker-57b4557498-9gh26@","logging-call-at":"processor.go:86"}
{"level":"info","ts":"2021-07-09T04:55:40.224Z","msg":"Started Worker","Namespace":"temporal-system","TaskQueue":"temporal-sys-add-search-attributes-task-queue","WorkerID":"13@temporal-worker-57b4557498-9gh26@","logging-call-at":"addsearchattributes.go:85"}
{"level":"info","ts":"2021-07-09T04:55:40.224Z","msg":"worker started","service":"worker","component":"worker","logging-call-at":"service.go:182"}
{"level":"info","ts":"2021-07-09T04:55:43.910Z","msg":"temporal-sys-tq-scanner-workflow workflow successfully started","service":"worker","logging-call-at":"scanner.go:186"}

We did not see this behavior in Helm deployments.
I compared the manifests and noticed that the worker pod deployment in the Helm chart does not contain a livess probe.

Is it intended that the worker pod does not perform any liveness checks?

1 Like

my guess is that the worker client does not have a http liveness endpoint configured for k8s to check and call by default.
In my deployment I’m creating a http end point by default for the k8s checks.