Workflow or Activity polling fails due to a gRpc error - Not enough hosts to serve the request

Hi,

worker service in our application fails to come up properly and it fails while polling activity queue and workflow queue with gRPC errors:

We are using java temporal-sdk version 1.0.7 with temporal server version 1.23.1. We had to upgrade the server version to fix a few security vulnerabilities and couldn’t upgrade SDK yet.

20-05-2024 20:57:46.271 UTC+0000 ERROR [temporalworker,] [Activity Poller taskQueue=“ActivityQueue”, namespace=“default”: 4] ERROR io.temporal.internal.worker.Poller.lambda$new$0 - Failure in thread Activity Poller taskQueue=“ActivityQueue”, namespace=“default”: 4
io.grpc.StatusRuntimeException: UNAVAILABLE: Not enough hosts to serve the request
at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:268)
at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:249)
at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:167)
at io.temporal.api.workflowservice.v1.WorkflowServiceGrpc$WorkflowServiceBlockingStub.pollActivityTaskQueue(WorkflowServiceGrpc.java:2683)
at io.temporal.internal.worker.ActivityPollTask.poll(ActivityPollTask.java:105)
at io.temporal.internal.worker.ActivityPollTask.poll(ActivityPollTask.java:39)
at io.temporal.internal.worker.Poller$PollExecutionTask.run(Poller.java:265)
at io.temporal.internal.worker.Poller$PollLoopTask.run(Poller.java:241)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)

20-05-2024 20:57:46.320 UTC+0000 ERROR [temporalworker,] [Workflow Poller taskQueue=“WorkflowQueue”, namespace=“default”: 1] ERROR io.temporal.internal.worker.Poller.lambda$new$0 - Failure in thread Workflow Poller taskQueue=“WorkflowQueue”, namespace=“default”: 1
io.grpc.StatusRuntimeException: UNAVAILABLE: Not enough hosts to serve the request
at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:268)
at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:249)
at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:167)
at io.temporal.api.workflowservice.v1.WorkflowServiceGrpc$WorkflowServiceBlockingStub.pollWorkflowTaskQueue(WorkflowServiceGrpc.java:2639)
at io.temporal.internal.worker.WorkflowPollTask.poll(WorkflowPollTask.java:81)
at io.temporal.internal.worker.WorkflowPollTask.poll(WorkflowPollTask.java:37)
at io.temporal.internal.worker.Poller$PollExecutionTask.run(Poller.java:265)
at io.temporal.internal.worker.Poller$PollLoopTask.run(Poller.java:241)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)

Is there a way to add this as a readiness probe? We currently check the health using the below condition.

tctl --address temporal-front-end-service-name-here:7233 cluster health

Note that this issue is not seen always, and we see it we try to install multiple applications on a kubernetes cluster used as test environment. One of 8 applications fail to come up properly with this error.
We started noticing this error only recently during our software stack version upgrade which included grpc library version upgrade and temporal server upgrade.

UNAVAILABLE: Not enough hosts to serve the request

yes this means at least not one of the matching/history services is not available (in the ring), so cluster instability.

your command just checks frontend service health, for matching/history could use for example:

grpc-health-probe -addr=<addrs>:7235 -service=temporal.api.workflowservice.v1.MatchingService

grpc-health-probe -addr=<addrs>:7234 -service=temporal.api.workflowservice.v1.HistoryService

(note 7235/7234 are default matching/history ports, if you changed them in config use those instead)

I have added both the health checks and my application(pod)_ now waits until the 3 health checks - frontend, matching and history. But I still see the same issue and below are the logs.
Application Pod:

[2024-05-24T19:56:21.209Z] 24-05-2024 19:53:35.667 ERROR [temporalworker,] [Activity Poller taskQueue=“WorkflowQueue”, namespace=“default”: 4] ERROR io.temporal.internal.worker.Poller.lambda$new$0 - Failure in thread Activity Poller taskQueue=“WorkflowQueue”, namespace=“default”: 4
[2024-05-24T19:56:21.209Z] io.grpc.StatusRuntimeException: UNAVAILABLE: Not enough hosts to serve the request
[2024-05-24T19:56:21.209Z] at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:268)
[2024-05-24T19:56:21.209Z] at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:249)
[2024-05-24T19:56:21.209Z] at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:167)
[2024-05-24T19:56:21.209Z] at io.temporal.api.workflowservice.v1.WorkflowServiceGrpc$WorkflowServiceBlockingStub.pollActivityTaskQueue(WorkflowServiceGrpc.java:2683)
[2024-05-24T19:56:21.209Z] at io.temporal.internal.worker.ActivityPollTask.poll(ActivityPollTask.java:105)
[2024-05-24T19:56:21.209Z] at io.temporal.internal.worker.ActivityPollTask.poll(ActivityPollTask.java:39)
[2024-05-24T19:56:21.209Z] at io.temporal.internal.worker.Poller$PollExecutionTask.run(Poller.java:265)
[2024-05-24T19:56:21.209Z] at io.temporal.internal.worker.Poller$PollLoopTask.run(Poller.java:241)
[2024-05-24T19:56:21.209Z] at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
[2024-05-24T19:56:21.209Z] at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
[2024-05-24T19:56:21.209Z] at java.base/java.lang.Thread.run(Thread.java:840)

frontend pod logs

{
“level”: “error”,
“ts”: “2024-05-24T19:54:52.831Z”,
“msg”: “Unable to call matching.PollWorkflowTaskQueue.”,
“service”: “frontend”,
“wf-task-queue-name”: “tnps-pr-277-9-temporal-worker-58fc674895-jxx84:f87dbf7e-ccab-4241-931d-ec25c306901f”,
“timeout”: “1m9.999443418s”,
“error”: “Not enough hosts to serve the request”,
“logging-call-at”: “workflow_handler.go:590”,
“stacktrace”: “go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/runner/work/docker-builds/docker-builds/temporal/common/log/zap_logger.go:156\ngo.temporal.io/server/service/frontend.(*WorkflowHandler).PollWorkflowTaskQueue\n\t/home/runner/work/docker-builds/docker-builds/temporal/service/frontend/workflow_handler.go:590\ngo.temporal.io/api/workflowservice/v1._WorkflowService_PollWorkflowTaskQueue_Handler.func1\n\t/home/runner/go/pkg/mod/go.temporal.io/api@v1.29.2/workflowservice/v1/service_grpc.pb.go:1690\ngo.temporal.io/server/common/rpc/interceptor.(*RetryableInterceptor).Intercept.func1\n\t/home/runner/work/docker-builds/docker-builds/temporal/common/rpc/interceptor/retry.go:63\ngo.temporal.io/server/common/backoff.ThrottleRetryContext\n\t/home/runner/work/docker-builds/docker-builds/temporal/common/backoff/retry.go:143\ngo.temporal.io/server/common/rpc/interceptor.(*RetryableInterceptor).Intercept\n\t/home/runner/work/docker-builds/docker-builds/temporal/common/rpc/interceptor/retry.go:67\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\t/home/runner/go/pkg/mod/google.golang.org/grpc@v1.63.2/server.go:1186\ngo.temporal.io/server/common/rpc/interceptor.(*CallerInfoInterceptor).Intercept\n\t/home/runner/work/docker-builds/docker-builds/temporal/common/rpc/interceptor/caller_info.go:80\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\t/home/runner/go/pkg/mod/google.golang.org/grpc@v1.63.2/server.go:1186\ngo.temporal.io/server/common/rpc/interceptor.(*SDKVersionInterceptor).Intercept\n\t/home/runner/work/docker-builds/docker-builds/temporal/common/rpc/interceptor/sdk_version.go:69\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\t/home/runner/go/pkg/mod/google.golang.org/grpc@v1.63.2/server.go:1186\ngo.temporal.io/server/common/rpc/interceptor.(*RateLimitInterceptor).Intercept\n\t/home/runner/work/docker-builds/docker-builds/temporal/common/rpc/interceptor/rate_limit.go:88\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\t/home/runner/go/pkg/mod/google.golang.org/grpc@v1.63.2/server.go:1186\ngo.temporal.io/server/common/rpc/interceptor.(*NamespaceRateLimitInterceptor).Intercept\n\t/home/runner/work/docker-builds/docker-builds/temporal/common/rpc/interceptor/namespace_rate_limit.go:93\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\t/home/runner/go/pkg/mod/google.golang.org/grpc@v1.63.2/server.go:1186\ngo.temporal.io/server/common/rpc/interceptor.(*ConcurrentRequestLimitInterceptor).Intercept\n\t/home/runner/work/docker-builds/docker-builds/temporal/common/rpc/interceptor/concurrent_request_limit.go:121\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\t/home/runner/go/pkg/mod/google.golang.org/grpc@v1.63.2/server.go:1186\ngo.temporal.io/server/common/rpc/interceptor.(*NamespaceValidatorInterceptor).StateValidationIntercept\n\t/home/runner/work/docker-builds/docker-builds/temporal/common/rpc/interceptor/namespace_validator.go:196\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\t/home/runner/go/pkg/mod/google.golang.org/grpc@v1.63.2/server.go:1186\ngo.temporal.io/server/common/rpc/interceptor.(*TelemetryInterceptor).UnaryIntercept\n\t/home/runner/work/docker-builds/docker-builds/temporal/common/rpc/interceptor/telemetry.go:165\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\t/home/runner/go/pkg/mod/google.golang.org/grpc@v1.63.2/server.go:1186\ngo.temporal.io/server/service/frontend.(*RedirectionInterceptor).handleRedirectAPIInvocation.func2\n\t/home/runner/work/docker-builds/docker-builds/temporal/service/frontend/redirection_interceptor.go:242\ngo.temporal.io/server/service/frontend.(*NoopRedirectionPolicy).WithNamespaceRedirect\n\t/home/runner/work/docker-builds/docker-builds/temporal/service/frontend/dc_redirection_policy.go:125\ngo.temporal.io/server/service/frontend.(*RedirectionInterceptor).handleRedirectAPIInvocation\n\t/home/runner/work/docker-builds/docker-builds/temporal/service/frontend/redirection_interceptor.go:239\ngo.temporal.io/server/service/frontend.(*RedirectionInterceptor).Intercept\n\t/home/runner/work/docker-builds/docker-builds/temporal/service/frontend/redirection_interceptor.go:199\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\t/home/runner/go/pkg/mod/google.golang.org/grpc@v1.63.2/server.go:1186\ngo.temporal.io/server/common/authorization.(*interceptor).Interceptor\n\t/home/runner/work/docker-builds/docker-builds/temporal/common/authorization/interceptor.go:162\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\t/home/runner/go/pkg/mod/google.golang.org/grpc@v1.63.2/server.go:1186\ngo.temporal.io/server/service/frontend.GrpcServerOptionsProvider.NewServerMetricsContextInjectorInterceptor.func1\n\t/home/runner/work/docker-builds/docker-builds/temporal/common/metrics/grpc.go:66\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\t/home/runner/go/pkg/mod/google.golang.org/grpc@v1.63.2/server.go:1186\ngo.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1\n\t/home/runner/go/pkg/mod/go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc@v0.51.0/interceptor.go:315\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\t/home/runner/go/pkg/mod/google.golang.org/grpc@v1.63.2/server.go:1186\ngo.temporal.io/server/common/rpc/interceptor.(*NamespaceLogInterceptor).Intercept\n\t/home/runner/work/docker-builds/docker-builds/temporal/common/rpc/interceptor/namespace_logger.go:84\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\t/home/runner/go/pkg/mod/google.golang.org/grpc@v1.63.2/server.go:1186\ngo.temporal.io/server/common/rpc/interceptor.(*NamespaceValidatorInterceptor).NamespaceValidateIntercept\n\t/home/runner/work/docker-builds/docker-builds/temporal/common/rpc/interceptor/namespace_validator.go:113\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\t/home/runner/go/pkg/mod/google.golang.org/grpc@v1.63.2/server.go:1186\ngo.temporal.io/server/common/rpc.ServiceErrorInterceptor\n\t/home/runner/work/docker-builds/docker-builds/temporal/common/rpc/grpc.go:145\ngoogle.golang.org/grpc.NewServer.chainUnaryServerInterceptors.chainUnaryInterceptors.func1\n\t/home/runner/go/pkg/mod/google.golang.org/grpc@v1.63.2/server.go:1177\ngo.temporal.io/api/workflowservice/v1._WorkflowService_PollWorkflowTaskQueue_Handler\n\t/home/runner/go/pkg/mod/go.temporal.io/api@v1.29.2/workflowservice/v1/service_grpc.pb.go:1692\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/home/runner/go/pkg/mod/google.golang.org/grpc@v1.63.2/server.go:1369\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/home/runner/go/pkg/mod/google.golang.org/grpc@v1.63.2/server.go:1780\ngoogle.golang.org/grpc.(*Server).serveStreams.func2.1\n\t/home/runner/go/pkg/mod/google.golang.org/grpc@v1.63.2/server.go:1019”
}

history pod logs

[2024-05-24T19:56:18.058Z] {“level”:“info”,“ts”:“2024-05-24T19:52:31.234Z”,“msg”:“Current reachable members”,“component”:“service-resolver”,“service”:“frontend”,“addresses”:[“10.244.204.8:7233”],“logging-call-at”:“service_resolver.go:275”}
[2024-05-24T19:56:18.058Z] {“level”:“info”,“ts”:“2024-05-24T19:52:34.658Z”,“msg”:“Current reachable members”,“component”:“service-resolver”,“service”:“worker”,“addresses”:[“10.244.156.155:7239”],“logging-call-at”:“service_resolver.go:275”}
[2024-05-24T19:56:18.058Z] {“level”:“info”,“ts”:“2024-05-24T19:54:00.591Z”,“msg”:“matching client encountered error”,“service”:“history”,“error”:“Not enough hosts to serve the request”,“service-error-type”:“serviceerror.Unavailable”,“logging-call-at”:“metric_client.go:219”}
{
“level”: “error”,
“ts”: “2024-05-24T19:54:00.591Z”,
“msg”: “Fail to process task”,
“shard-id”: 403,
“address”: “10.244.204.29:7234”,
“component”: “transfer-queue-processor”,
“wf-namespace-id”: “49816662-1b07-4751-abdb-a498634eb3ba”,
“wf-id”: “dd46c429-8e7e-11ee-89c2-a864f1c26900”,
“wf-run-id”: “2b88a9a5-e79d-4ed2-be30-1e72d0b01fd7”,
“queue-task-id”: 1048583,
“queue-task-visibility-timestamp”: “2024-05-24T19:54:00.546Z”,
“queue-task-type”: “TransferWorkflowTask”,
“queue-task”: {
“NamespaceID”: “49816662-1b07-4751-abdb-a498634eb3ba”,
“WorkflowID”: “dd46c429-8e7e-11ee-89c2-a864f1c26900”,
“RunID”: “2b88a9a5-e79d-4ed2-be30-1e72d0b01fd7”,
“VisibilityTimestamp”: “2024-05-24T19:54:00.546537479Z”,
“TaskID”: 1048583,
“TaskQueue”: “WorkflowQueue”,
“ScheduledEventID”: 2,
“Version”: 0
},
“wf-history-event-id”: 2,
“error”: “Not enough hosts to serve the request”,
“unexpected-error-attempts”: 1,
“lifecycle”: “ProcessingFailed”,
“logging-call-at”: “lazy_logger.go:68”,
“stacktrace”: “go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/runner/work/docker-builds/docker-builds/temporal/common/log/zap_logger.go:156\ngo.temporal.io/server/common/log.(*lazyLogger).Error\n\t/home/runner/work/docker-builds/docker-builds/temporal/common/log/lazy_logger.go:68\ngo.temporal.io/server/service/history/queues.(*executableImpl).HandleErr\n\t/home/runner/work/docker-builds/docker-builds/temporal/service/history/queues/executable.go:421\ngo.temporal.io/server/common/tasks.(*FIFOScheduler[…]).executeTask.func1\n\t/home/runner/work/docker-builds/docker-builds/temporal/common/tasks/fifo_scheduler.go:224\ngo.temporal.io/server/common/backoff.ThrottleRetry.func1\n\t/home/runner/work/docker-builds/docker-builds/temporal/common/backoff/retry.go:117\ngo.temporal.io/server/common/backoff.ThrottleRetryContext\n\t/home/runner/work/docker-builds/docker-builds/temporal/common/backoff/retry.go:143\ngo.temporal.io/server/common/backoff.ThrottleRetry\n\t/home/runner/work/docker-builds/docker-builds/temporal/common/backoff/retry.go:118\ngo.temporal.io/server/common/tasks.(*FIFOScheduler[…]).executeTask\n\t/home/runner/work/docker-builds/docker-builds/temporal/common/tasks/fifo_scheduler.go:233\ngo.temporal.io/server/common/tasks.(*FIFOScheduler[…]).processTask\n\t/home/runner/work/docker-builds/docker-builds/temporal/common/tasks/fifo_scheduler.go:211”
}

But I noticed one difference in matching pod logs when things are fine vs when they are not:

boot strap host ports are different

Working scenario matching pod logs:

[2024-05-24T19:54:03.106Z] {“level”:“info”,“ts”:“2024-05-24T19:50:17.248Z”,“msg”:“bootstrap hosts fetched”,“bootstrap-hostports”:“10.244.204.4:6934,10.244.204.15:6935,10.244.156.132:6933”,“logging-call-at”:“monitor.go:348”}
[2024-05-24T19:54:03.106Z] {“level”:“info”,“ts”:“2024-05-24T19:50:17.250Z”,“msg”:“Current reachable members”,“component”:“service-resolver”,“service”:“history”,“addresses”:[“10.244.204.4:7234”],“logging-call-at”:“service_resolver.go:275”}
[2024-05-24T19:54:03.106Z] {“level”:“info”,“ts”:“2024-05-24T19:50:17.250Z”,“msg”:“Current reachable members”,“component”:“service-resolver”,“service”:“matching”,“addresses”:[“10.244.204.15:7235”],“logging-call-at”:“service_resolver.go:275”}
[2024-05-24T19:54:03.106Z] {“level”:“info”,“ts”:“2024-05-24T19:50:17.250Z”,“msg”:“Current reachable members”,“component”:“service-resolver”,“service”:“frontend”,“addresses”:[“10.244.156.132:7233”],“logging-call-at”:“service_resolver.go:275”}
[2024-05-24T19:54:03.106Z] {“level”:“info”,“ts”:“2024-05-24T19:50:23.177Z”,“msg”:“Current reachable members”,“component”:“service-resolver”,“service”:“worker”,“addresses”:[“10.244.204.51:7239”],“logging-call-at”:“service_resolver.go:275”}

Failing scenario matching pod logs

[2024-05-24T19:56:18.623Z] {“level”:“info”,“ts”:“2024-05-24T19:54:01.385Z”,“msg”:“bootstrap hosts fetched”,“bootstrap-hostports”:“10.244.125.19:6935”,“logging-call-at”:“monitor.go:348”}
[2024-05-24T19:56:18.623Z] {“level”:“info”,“ts”:“2024-05-24T19:54:01.387Z”,“msg”:“Current reachable members”,“component”:“service-resolver”,“service”:“matching”,“addresses”:[“10.244.125.19:7235”],“logging-call-at”:“service_resolver.go:275”}

[Update]
Please ignore this difference. I have noticed the same when it is failing as well.

In one case, application pod and front-end pod had errors but matching and history didn’t have any. Restarting front end pod seemed to have resolved the issue.

In another case, restarting matching pod resolved the issue.

Also note that our Kubernetes cluster can host upto 40 applications, and each application has its own temporal packaged along with it. All pods, services have their own unique name

We haven’t seen these errors until a few weeks ago. We have recently upgraded temporal server from 1.9.2 to 1.23.1 and we have since then started seeing these errors.

Most of the times, restarting matching or front end or history pod seemed to have solved the problem.

We are unable to identify if some resource load is causing the temporal pods to come up prematurely and fail all requests.

We would like to know if there is any readiness or liveness or health checks to be introduced inside temporal deployments