Temporal cluster always seems to be out of resources but always seems healthy

Background to my question:

Once in a while when I try to use temporal, when trying to run a workflow I get this error: “no hosts are available to serve the request”.

I’m really confused with this because my cluster seems quite healthy. If I run:

tctl --address temporal.b8s.biz:7233 cluster health 
> temporal.api.workflowservice.v1.WorkflowService: SERVING

If i try to view my logs I see this:

temporal-worker

2022/06/06 00:01:42 INFO
Fetching price quote Namespace default TaskQueue internal-executor WorkerID 1@internal-executor-d4695c6d9-vd2x7@ ActivityID 11 Activity
Type Quote Attempt 1 WorkflowType ExecuteTrade WorkflowID 930f6df-52ad-42f1-bd6e-6a1d29671d3 RunID 5437c43-2686-456-9493-34110565ee6
2022/06/06 00:01:52 INFO Task processing failed with error Namespace default TaskQueue internal-executor WorkerID 1@internal-executor-d4695c6d9-vd2x7@ WorkerTyp e ActivityWorker Error context deadline exceeded
2022/06/06 00:02:52 WARN Failed to poll for task. Namespace default TaskQueue internal-executor WorkerID 1@internal-executor-d4695c6d9-vd2x7@ WorkerType Activit Worker Error context deadline exceeded
2022/06/06 00:04:02 WARN Failed to poll for task. Namespace default TaskQueue internal-executor WorkerID 1@internal-executor-d4695c6d9-vd2x7@ WorkerType Activit Worker Error context deadline exceeded
2022/06/06 00:07:12 WARN Failed to poll for task. Namespace default TaskQueue internal-executor WorkerID 1@internal-executor-d4695c6d9-vd2x7@ WorkerType Activit Worker Error context deadline exceeded
2022/06/06 00:09:21 WARN Failed to poll for task. Namespace default TaskQueue internal-executor WorkerID 1@internal-executor-d4695c6d9-vd2x7@ WorkerType Activit Worker Error context deadline exceeded
2022/06/06 00:12:40 WARN Failed to poll for task. Namespace default TaskQueue internal-executor WorkerID 1@internal-executor-d4695c6d9-vd2x7@ WorkerType Activit Worker Error context deadline exceeded
2022/06/06 00:13:51 WARN Failed to poll for task. Namespace default TaskQueue internal-executor WorkerID 1@internal-executor-d4695c6d9-vd2x7@ WorkerType Activit Worker Error context deadline exceeded
2022/06/06 00: 15:01 WARN Failed to poll for task. Namespace default TaskQueue internal-executor WorkerID 1@internal-executor-d4695c6d9-vd2x7@ WorkerType Activit Worker Error context deadline exceeded
2022/06/06 00:16:12 WARN Failed to poll for task. Namespace default TaskQueue internal-executor WorkerID 1@internal-executor-d4695c6d9-vd2x7@ WorkerType Activit Worker Error context deadline exceeded

temporal frontend

{"level": "error", "ts": "2022-06-06T00:20:50.4237", "msg": "Unable to call matching. PollWorkflowTaskQueue.", "service": "frontend", "wf-task-queue-name":"client-reporti ng", "timeout": "Im9.999685407s","error":"context deadline exceeded", "logging-call-at": "workflowHandler.go:808","stacktrace":"go.temporal.io/server/common/log.(*za pLogger).Error\n\t/temporal/common/log/zap_logger.go:142\ngo.temporal.io/server/service/frontend.(*WorkflowHandler).PollWorkflowTaskQueue\n\t/temporal/service/fr ontend/workflowHandler.go:808\ngo.temporal.io/server/service/frontend.(*DCRedirectionHandlerImpl).PollWorkflowTaskQueue.func2\n\t/temporal/service/frontend/dcRed irectionHandler.go:540\ngo.temporal.io/server/service/frontend.(*NoopRedirectionPolicy).WithNamespaceRedirect\n\t/temporal/service/frontend/dcRedirectionPolicy.g o:118\ngo. temporal.io/server/service/frontend.(*DCRedirectionHandlerImpl).PollWorkflowTaskQueue\n\t/temporal/service/frontend/dcRedirectionHandler.go:536\ngo.tem poral.io/api/workflowservice/vl._WorkflowService_PollWorkflowTaskQueue_Handler.func1\n\t/go/pkg/mod/go.temporal.io/api@v1.7.0/workflowservice/vl/service.pb.go:11 40\ngo.temporal.io/server/common/authorization.(*interceptor).Interceptor\n\t/temporal/common/authorization/interceptor.go:152\ngoogle.golang.org/grpc.chainUnary Interceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.42.0/server.go:1116\ngo.temporal.io/server/common/rpc/interceptor.(*NamespaceCountLimitInterceptor)
.Intercept\n\t/temporal/common/rpc/interceptor/namespace_count_limit.go:98\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.or g/grpc@v1.42.0/server.go:1119\ngo.temporal.io/server/common/rpc/interceptor.(*NamespaceRateLimitInterceptor).Intercept\n\t/temporal/common/rpc/interceptor/namesp ace_rate_limit.go:88\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.42.0/server.go:1119\ngo.temporal.io/server/c ommon/rpc/interceptor.(*RateLimitInterceptor).Intercept\n\t/temporal/common/rpc/interceptor/rate_limit.go:83\ngoogle.golang.org/grpc.chainUnaryInterceptors.funci
.1\n\t/qo/pkg/mod/google.golang.org/grpc@v1.42.0/server.qo:1119\ngo.temporal.io/server/common/rpc/interceptor.(*NamespaceValidatorInterceptor).Intercept\n\t/temp oral/common/rpc/interceptor/namespace_validator.go:113\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.42.0/serve r.go:1119\ngo.temporal.io/server/common/rpc/interceptor.(*TelemetryInterceptor).Intercept\n\t/temporal/common/rpc/interceptor/telemetry.go:108\ngoogle.golang.org /grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.42.0/server.go:1119\ngo.temporal.io/server/common/metrics.NewServerMetricsContextIn jectorInterceptor.func1\n\t/temporal/common/metrics/grpc.go:66\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.42
O/server.go:1119\ngo.temporal.io/server/common/rpc.ServiceErrorInterceptor\n\t/temporal/common/rpc/grpc.go:131\ngoogle.golang.org/grpc.chainUnaryInterceptors.fu nc1.1\n\t/go/pkg/mod/gooqle.golang.org/grpc@v1.42.0/server.go:1119\ngo.temporal.io/server/common/rpc/interceptor.(*NamespaceLoqInterceptor).Intercept\n\t/tempora 1/common/rpc/interceptor/namespace_logger.go:84\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\nlt/go/pkg/mod/google.golang.org/grpc@v1.42.0/server.go:11
19\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1\n\t/go/pkg/mod/google.golang.org/grpc@v1.42.0/server.go:1121\ngo.temporal.io/api/workflowservice/vl._Work flowService_PollWorkflowTaskQueue_Handler\n\t/go/pkg/mod/go.temporal.io/api@v1.7.0/workflowservice/vl/service.pb.go:1142\ngoogle.golang.org/grpc.(*Server).proces sUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.42.0/server.go:1282\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/qrpc@v1.42
O/server.go:1616\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.42.0/server.go:921"}

I can see clearly somethings wrong as the temporal worker can’t poll for requests, and the frontend has these context deadline exceeded errors. I have 2 workers, and 2 frontends (temporal installed via helm). I don’t really have too many tasks. I can’t quite figure out where the ‘jam’ is coming from.

Is there a way to figure out why it’s doing this? Additional is there a way to find out how resource constrained the workers/frontend is? Also is there any reason it could be behaving this way (even though i don’t have too many tasks?)

“no hosts are available to serve the request”.

Do you mean “Not enough hosts to serve the request”? If so this comes from ringpop, and means there is no host available for the requested server role. This could happen on server shutdown but it could also mean deployment issues, for example all your service X pods are down or restarting.
Also see this forum post that reports similar issue on kubernetes setup.

tctl --address temporal.b8s.biz:7233 cluster health

tctl cluster health health checks only the temporal frontend service.
If you are deploying on k8s you could use grpc-health-probe to health check frontend, matching and history services, for example:

matching:
./grpc-health-probe -addr=localhost:7235 -service=temporal.api.workflowservice.v1.MatchingService

history:
./grpc-health-probe -addr=localhost:7234 -service=temporal.api.workflowservice.v1.HistoryService

frontend:
./grpc-health-probe -addr=localhost:7233 -service=temporal.api.workflowservice.v1.WorkflowServicePreformatted text

(change the host:port to whatever you need to set it to)

Unable to call matching.PollWorkflowTaskQueue.

this is typically ignorable, can happen when there are no tasks on the task queues your workers poll on, and the frontend service times out on a poll request.

I would start looking at your cluster health and see if pods might be restarting / having memory issues?
Another thing could be configuration issues, are you using the Temporal helm charts?