Worker Service Pod Crashed

Hi,

Currently I am using temporal-server v1.12.0

I have deployed the worker service in kubernetes as individual pod and it’s keep Running & Crashing. what is the issue with this?

Here the below error.

{“level”:“debug”,“ts”:“2021-09-07T12:16:52.304Z”,“msg”:“Membership heartbeat upserted successfully”,“service”:“worker”,“address”:“100.127.29.165”,“port”:6939,“hostId”:“7bc42600-0fd5-11ec-82ea-a230519320c5”,“logging-call-at”:“rpMonitor.go:163”}
{“level”:“fatal”,“ts”:“2021-09-07T12:17:02.349Z”,“msg”:“error starting scanner”,“service”:“worker”,“error”:“context deadline exceeded”,“logging-call-at”:“service.go:255”,“stacktrace”:“go.temporal.io/server/common/log.(*zapLogger).Fatal\n\t/temporal/common/log/zap_logger.go:151\ngo.temporal.io/server/service/worker.(*Service).startScanner\n\t/temporal/service/worker/service.go:255\ngo.temporal.io/server/service/worker.(*Service).Start\n\t/temporal/service/worker/service.go:175\ngo.temporal.io/server/temporal.(*Server).Start.func1\n\t/temporal/temporal/server.go:221”}

Are you able to see anything in your server logs?
Try looking for fatal error log that starts with “error starting scanner”.
Do you get this error after a longer period of worker inactivity? If so then this thread might be helpful.

Still i am getting the same issue after two fields added in development.yml.

“level”:“fatal”,“ts”:“2021-09-08T10:00:19.843Z”,“msg”:“error starting scanner”,“service”:“worker”,“error”:“context deadline exceeded”,“logging-call-at”:“service.go:255”,“stacktrace”:“go.temporal.io/server/common/log.(*zapLogger).Fatal\n\t/temporal/common/log/zap_logger.go:151\ngo.temporal.io/server/service/worker.(*Service).startScanner\n\t/temporal/service/worker/service.go:255\ngo.temporal.io/server/service/worker.(*Service).Start\n\t/temporal/service/worker/service.go:175\ngo.temporal.io/server/temporal.(*Server).Start.func1\n\t/temporal/temporal/server.go:221”}

See the below consolidate dynamicconfig/development.yml

What is missing here?

frontend.enableClientVersionCheck:

  • value: true
    constraints: {}
    history.persistenceMaxQPS:
  • value: 3000
    constraints: {}
    frontend.persistenceMaxQPS:
  • value: 3000
    constraints: {}
    frontend.historyMgrNumConns:
  • value: 10
    constraints: {}
    frontend.throttledLogRPS:
  • value: 20
    constraints: {}
    frontend.keepAliveMaxConnectionAge:
  • value: 5m
    frontend.keepAliveMaxConnectionAgeGrace:
  • value: 70s
    history.historyMgrNumConns:
  • value: 50
    constraints: {}
    history.defaultActivityRetryPolicy:
  • value:
    InitialIntervalInSeconds: 1
    MaximumIntervalCoefficient: 100.0
    BackoffCoefficient: 2.0
    MaximumAttempts: 0
    history.defaultWorkflowRetryPolicy:
  • value:
    InitialIntervalInSeconds: 1
    MaximumIntervalCoefficient: 100.0
    BackoffCoefficient: 2.0
    MaximumAttempts: 0
    system:
    minRetentionDays: 365
    system.advancedVisibilityWritingMode:
    • value: “off”
      constraints: {}
      system.enableReadVisibilityFromES:
    • value: true
      constraints: {}

Checked with the server team and they asked if you can
confirm the connectivity between your worker pod and wherever you have the frontend service deployed?
One path to this error is the worker service being unable to validate the existence of it’s internal namespace with the frontend service (via gRPC).

Yes, frontend service is Running successfully.

image

You can see the worker service is frequently crashing every 5 minutes and becoming the Running.
During the crash i check the below logs.

I have verified service.worker.scanner.scanner.go.
I have increased to 7 Minutes but it doesn’t reflect and crash every 5 mins. I am not sure what is the root cause of this. Please let me know.

I see at least two different error messages in your logs. context deadline exceeded comes from “SDK worker” which is inside “worker service” (sorry for overloaded “worker” term), and what I see from the code, the only place where it can be returned from worker.Start() is when SDK checks for namespace. So clearly, worker service doesn’t have access to frontend. I didn’t follow stack trace of the second error, but it seems that startWorkflow fails because of the same reason.

To double check this you can shell into the worker service container and run tctl cluster health. If it gives you an error then you need to check your k8s setup.

worker service doesn’t looks healthy.

@alex - can you please let me know what could be reason?

@alex , @tihomir
I was going through the other blogs to configure the publicClient i have configured this but it doesn’t work.

publicClient:
hostPort: “server-asyncworkflow-local.apps.mt-d2.carl.gkp.net7233”

Worker pod health

bash-4.2$ tctl cluster health
Error: Unable to get “temporal.api.workflowservice.v1.WorkflowService” health check status.
Error Details: rpc error: code = Unavailable desc = connection error: desc = “transport: Error while dialing dial tcp 127.0.0.1:7233: connect: connection refused”
(‘export TEMPORAL_CLI_SHOW_STACKS=1’ to see stack traces)

What could be right way to fix this issue?

tctl doesn’t read config. Try:

tctl --address server-asyncworkflow-local.apps.mt-d2.carl.gkp.net:7233 cluster health

If it works then you need to set the same value to publicClient.hostPort in config file of your worker service (you missed : in a code snippet above).

Sorry it was typo mistake (:).

tctl cluster health works only for frontend pod service.
Rest of the services pods doesn’t show healthy with Running status (but worker pod crashes as mentioned above)

Even frontend pod doesn’t show healthy with address.

$ kubectl exec -it frontend-84b9b86577-ztk6k -c frontend – bash
bash-4.2$ tctl --address server-asyncworkflow-local.apps.mt-d2.carl.gkp.net:7233 cluster health
Error: Unable to get “temporal.api.workflowservice.v1.WorkflowService” health check status.
Error Details: rpc error: code = DeadlineExceeded desc = context deadline exceeded
(‘export TEMPORAL_CLI_SHOW_STACKS=1’ to see stack traces)
bash-4.2$ tctl cluster health
temporal.api.workflowservice.v1.WorkflowService: SERVING

All other services don’t have to be directly accessed from worker service. Worker “talks” only to frontend. I am not network pro, but apparently server-asyncworkflow-local.apps.mt-d2.carl.gkp.net is not accessible neither from worker nor from frontend itself. I think this should work:

$ kubectl exec -it frontend-84b9b86577-ztk6k -c frontend – bash
bash-4.2$ tctl cluster health

tctl will talk to localhost by default and if you run it on frontend itself it should be reachable. This is just to check that health API on frontend is working properly. To access it from worker service you need to figure out your network/deployment topology and right DNS name that worker should use.

It seems by default temporal health check configured for frontend service, is that true?

if yes, then we can’t perform the health for other services.

So, how to enable other 3 services health ?

History and matching also expose health check endpoints but tctl doesn’t check them. You need some gRPC tool which may call it. AWS LB also supports gRPC health check. Worker doesn’t have one.

How do you plan to use health check endpoints?

Issue has been resolved.
My application deployed in k8s. I was referring fqdn as hostPort instead of frontend service name (dns).

publicClient:
hostPort: “frontend:7233”

Now I can get into each pod services and can check the health of the pod resulted as SERVING.

tctl --address frontend:7233 cluster health

2 Likes