Worker Service Pod Crashed

jaffarsadik · September 7, 2021, 2:04pm

Hi,

Currently I am using temporal-server v1.12.0

I have deployed the worker service in kubernetes as individual pod and it’s keep Running & Crashing. what is the issue with this?

Here the below error.

{“level”:“debug”,“ts”:“2021-09-07T12:16:52.304Z”,“msg”:“Membership heartbeat upserted successfully”,“service”:“worker”,“address”:“100.127.29.165”,“port”:6939,“hostId”:“7bc42600-0fd5-11ec-82ea-a230519320c5”,“logging-call-at”:“rpMonitor.go:163”}
{“level”:“fatal”,“ts”:“2021-09-07T12:17:02.349Z”,“msg”:“error starting scanner”,“service”:“worker”,“error”:“context deadline exceeded”,“logging-call-at”:“service.go:255”,“stacktrace”:“go.temporal.io/server/common/log.(*zapLogger).Fatal\n\t/temporal/common/log/zap_logger.go:151\ngo.temporal.io/server/service/worker.(*Service).startScanner\n\t/temporal/service/worker/service.go:255\ngo.temporal.io/server/service/worker.(*Service).Start\n\t/temporal/service/worker/service.go:175\ngo.temporal.io/server/temporal.(*Server).Start.func1\n\t/temporal/temporal/server.go:221”}

tihomir · September 7, 2021, 4:10pm

Are you able to see anything in your server logs?
Try looking for fatal error log that starts with “error starting scanner”.
Do you get this error after a longer period of worker inactivity? If so then this thread might be helpful.

jaffarsadik · September 8, 2021, 10:20am

Still i am getting the same issue after two fields added in development.yml.

“level”:“fatal”,“ts”:“2021-09-08T10:00:19.843Z”,“msg”:“error starting scanner”,“service”:“worker”,“error”:“context deadline exceeded”,“logging-call-at”:“service.go:255”,“stacktrace”:“go.temporal.io/server/common/log.(*zapLogger).Fatal\n\t/temporal/common/log/zap_logger.go:151\ngo.temporal.io/server/service/worker.(*Service).startScanner\n\t/temporal/service/worker/service.go:255\ngo.temporal.io/server/service/worker.(*Service).Start\n\t/temporal/service/worker/service.go:175\ngo.temporal.io/server/temporal.(*Server).Start.func1\n\t/temporal/temporal/server.go:221”}

See the below consolidate dynamicconfig/development.yml

What is missing here?

frontend.enableClientVersionCheck:

value: true
constraints: {}
history.persistenceMaxQPS:

value: 3000
constraints: {}
frontend.persistenceMaxQPS:

value: 3000
constraints: {}
frontend.historyMgrNumConns:

value: 10
constraints: {}
frontend.throttledLogRPS:

value: 20
constraints: {}
frontend.keepAliveMaxConnectionAge:

value: 5m
frontend.keepAliveMaxConnectionAgeGrace:

value: 70s
history.historyMgrNumConns:

value: 50
constraints: {}
history.defaultActivityRetryPolicy:

value:
InitialIntervalInSeconds: 1
MaximumIntervalCoefficient: 100.0
BackoffCoefficient: 2.0
MaximumAttempts: 0
history.defaultWorkflowRetryPolicy:

value:
InitialIntervalInSeconds: 1
MaximumIntervalCoefficient: 100.0
BackoffCoefficient: 2.0
MaximumAttempts: 0
system:
minRetentionDays: 365
system.advancedVisibilityWritingMode:

value: “off”
constraints: {}
system.enableReadVisibilityFromES:

value: true
constraints: {}

tihomir · September 8, 2021, 2:01pm

Checked with the server team and they asked if you can
confirm the connectivity between your worker pod and wherever you have the frontend service deployed?
One path to this error is the worker service being unable to validate the existence of it’s internal namespace with the frontend service (via gRPC).

jaffarsadik · September 8, 2021, 3:57pm

Yes, frontend service is Running successfully.

You can see the worker service is frequently crashing every 5 minutes and becoming the Running.
During the crash i check the below logs.

I have verified service.worker.scanner.scanner.go.
I have increased to 7 Minutes but it doesn’t reflect and crash every 5 mins. I am not sure what is the root cause of this. Please let me know.

alex · September 8, 2021, 5:31pm

I see at least two different error messages in your logs. context deadline exceeded comes from “SDK worker” which is inside “worker service” (sorry for overloaded “worker” term), and what I see from the code, the only place where it can be returned from worker.Start() is when SDK checks for namespace. So clearly, worker service doesn’t have access to frontend. I didn’t follow stack trace of the second error, but it seems that startWorkflow fails because of the same reason.

To double check this you can shell into the worker service container and run tctl cluster health. If it gives you an error then you need to check your k8s setup.

jaffarsadik · September 8, 2021, 6:22pm

worker service doesn’t looks healthy.

@alex - can you please let me know what could be reason?

jaffarsadik · September 9, 2021, 12:20pm

@alex , @tihomir
I was going through the other blogs to configure the publicClient i have configured this but it doesn’t work.

publicClient:
hostPort: “server-asyncworkflow-local.apps.mt-d2.carl.gkp.net7233”

Worker pod health

bash-4.2$ tctl cluster health
Error: Unable to get “temporal.api.workflowservice.v1.WorkflowService” health check status.
Error Details: rpc error: code = Unavailable desc = connection error: desc = “transport: Error while dialing dial tcp 127.0.0.1:7233: connect: connection refused”
(‘export TEMPORAL_CLI_SHOW_STACKS=1’ to see stack traces)

What could be right way to fix this issue?

alex · September 9, 2021, 10:18pm

tctl doesn’t read config. Try:

tctl --address server-asyncworkflow-local.apps.mt-d2.carl.gkp.net:7233 cluster health

If it works then you need to set the same value to publicClient.hostPort in config file of your worker service (you missed : in a code snippet above).

jaffarsadik · September 13, 2021, 11:20am

Sorry it was typo mistake (:).

tctl cluster health works only for frontend pod service.
Rest of the services pods doesn’t show healthy with Running status (but worker pod crashes as mentioned above)

Even frontend pod doesn’t show healthy with address.

$ kubectl exec -it frontend-84b9b86577-ztk6k -c frontend – bash
bash-4.2$ tctl --address server-asyncworkflow-local.apps.mt-d2.carl.gkp.net:7233 cluster health
Error: Unable to get “temporal.api.workflowservice.v1.WorkflowService” health check status.
Error Details: rpc error: code = DeadlineExceeded desc = context deadline exceeded
(‘export TEMPORAL_CLI_SHOW_STACKS=1’ to see stack traces)
bash-4.2$ tctl cluster health
temporal.api.workflowservice.v1.WorkflowService: SERVING

alex · September 13, 2021, 4:49pm

All other services don’t have to be directly accessed from worker service. Worker “talks” only to frontend. I am not network pro, but apparently server-asyncworkflow-local.apps.mt-d2.carl.gkp.net is not accessible neither from worker nor from frontend itself. I think this should work:

$ kubectl exec -it frontend-84b9b86577-ztk6k -c frontend – bash
bash-4.2$ tctl cluster health

tctl will talk to localhost by default and if you run it on frontend itself it should be reachable. This is just to check that health API on frontend is working properly. To access it from worker service you need to figure out your network/deployment topology and right DNS name that worker should use.

jaffarsadik · September 14, 2021, 3:07pm

It seems by default temporal health check configured for frontend service, is that true?

if yes, then we can’t perform the health for other services.

So, how to enable other 3 services health ?

alex · September 14, 2021, 6:54pm

History and matching also expose health check endpoints but tctl doesn’t check them. You need some gRPC tool which may call it. AWS LB also supports gRPC health check. Worker doesn’t have one.

How do you plan to use health check endpoints?

jaffarsadik · September 15, 2021, 4:11pm

Issue has been resolved.
My application deployed in k8s. I was referring fqdn as hostPort instead of frontend service name (dns).

publicClient:
hostPort: “frontend:7233”

Now I can get into each pod services and can check the health of the pod resulted as SERVING.

tctl --address frontend:7233 cluster health

Topic		Replies	Views
Worker pod crasing Community Support kubernetes	3	695	May 17, 2022
EKS Deployment: Worker Crash Loop Community Support multicluster	4	1029	September 7, 2021
Worker does not start on kubernetes Community Support deployment , kubernetes , tls	6	911	January 5, 2024
Worker pod throwing error creating sdk client when linkerd is enabled Community Support	0	287	January 29, 2024
History and worker service errors Community Support	2	495	October 20, 2022

Worker Service Pod Crashed

Related topics