Temporal worker frontend connection fails

alec · April 26, 2021, 7:08pm

Hi all,

Our temporal worker deployed in kubernetes has been humming along, but for some reason, last night it started to continuously receive errors like this:

18:56:10.136 [Activity Poller taskQueue="singer-activity-task-list", namespace="singer-activity-namespace": 5] ERROR io.temporal.internal.worker.Poller - Failure in thread Activity Poller taskQueue="singer-activity-task-list", namespace="singer-activity-namespace": 5 io.grpc.StatusRuntimeException: UNAVAILABLE: Unable to resolve host temporalio-frontend-headless.temporalio-dev.svc at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:262) at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:243) at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:156) at io.temporal.api.workflowservice.v1.WorkflowServiceGrpc$WorkflowServiceBlockingStub.pollActivityTaskQueue(WorkflowServiceGrpc.java:2683) at io.temporal.internal.worker.ActivityPollTask.poll(ActivityPollTask.java:105) at io.temporal.internal.worker.ActivityPollTask.poll(ActivityPollTask.java:39) at io.temporal.internal.worker.Poller$PollExecutionTask.run(Poller.java:265) at io.temporal.internal.worker.Poller$PollLoopTask.run(Poller.java:241) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.RuntimeException: java.net.UnknownHostException: temporalio-frontend-headless.temporalio-dev.svc: Try again at io.grpc.internal.DnsNameResolver.resolveAddresses(DnsNameResolver.java:223) at io.grpc.internal.DnsNameResolver.doResolve(DnsNameResolver.java:282) at io.grpc.internal.DnsNameResolver$Resolve.run(DnsNameResolver.java:318) ... 3 common frames omitted Caused by: java.net.UnknownHostException: temporalio-frontend-headless.temporalio-dev.svc: Try again at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method) at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929) at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324) at java.net.InetAddress.getAllByName0(InetAddress.java:1277) at java.net.InetAddress.getAllByName(InetAddress.java:1193) at java.net.InetAddress.getAllByName(InetAddress.java:1127) at io.grpc.internal.DnsNameResolver$JdkAddressResolver.resolveAddress(DnsNameResolver.java:631) at io.grpc.internal.DnsNameResolver.resolveAddresses(DnsNameResolver.java:219) ... 5 common frames omitted

However, despite these errors, it is still able to create workflows and communicate with them. I can’t understand why this is suddenly happening as no pods went down before these errors popped up.

Here’s logs from the frontend service, which was timing out:
{"level":"error","ts":"2021-04-26T12:18:18.017Z","msg":"Operation failed with internal error.","service":"frontend","error":"GetMetadata operation failed. Error: dial tcp: i/o timeout","metric-scope":52,"logging-call-at":"persistenceMetricClients.go:932","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:136\ngo.temporal.io/server/common/persistence.(*metadataPersistenceClient).updateErrorMetric\n\t/temporal/common/persistence/persistenceMetricClients.go:932\ngo.temporal.io/server/common/persistence.(*metadataPersistenceClient).GetMetadata\n\t/temporal/common/persistence/persistenceMetricClients.go:910\ngo.temporal.io/server/common/cache.(*namespaceCache).refreshNamespacesLocked\n\t/temporal/common/cache/namespaceCache.go:435\ngo.temporal.io/server/common/cache.(*namespaceCache).refreshNamespaces\n\t/temporal/common/cache/namespaceCache.go:425\ngo.temporal.io/server/common/cache.(*namespaceCache).refreshLoop\n\t/temporal/common/cache/namespaceCache.go:409"} {"level":"error","ts":"2021-04-26T12:18:18.017Z","msg":"Error refreshing namespace cache","service":"frontend","error":"GetMetadata operation failed. Error: dial tcp: i/o timeout","logging-call-at":"namespaceCache.go:414","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:136\ngo.temporal.io/server/common/cache.(*namespaceCache).refreshLoop\n\t/temporal/common/cache/namespaceCache.go:414"} {"level":"error","ts":"2021-04-26T12:18:20.914Z","msg":"Membership upsert failed.","service":"frontend","error":"UpsertClusterMembership operation failed. Error: dial tcp: i/o timeout","logging-call-at":"rpMonitor.go:276","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:136\ngo.temporal.io/server/common/membership.(*ringpopMonitor).startHeartbeatUpsertLoop.func1\n\t/temporal/common/membership/rpMonitor.go:276"}

And the history service:
{"level":"error","ts":"2021-04-26T10:50:08.172Z","msg":"Operation failed with internal error.","service":"history","error":"GetTransferTasks operation failed. Select failed. Error: dial tcp: i/o timeout","metric-scope":15,"shard-id":144,"logging-call-at":"persistenceMetricClients.go:676","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:136\ngo.temporal.io/server/common/persistence.(*workflowExecutionPersistenceClient).updateErrorMetric\n\t/temporal/common/persistence/persistenceMetricClients.go:676\ngo.temporal.io/server/common/persistence.(*workflowExecutionPersistenceClient).GetTransferTasks\n\t/temporal/common/persistence/persistenceMetricClients.go:391\ngo.temporal.io/server/service/history.(*transferQueueProcessorBase).readTasks\n\t/temporal/service/history/transferQueueProcessorBase.go:81\ngo.temporal.io/server/service/history.(*queueAckMgrImpl).readQueueTasks.func1\n\t/temporal/service/history/queueAckMgr.go:106\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:103\ngo.temporal.io/server/service/history.(*queueAckMgrImpl).readQueueTasks\n\t/temporal/service/history/queueAckMgr.go:110\ngo.temporal.io/server/service/history.(*queueProcessorBase).processBatch\n\t/temporal/service/history/queueProcessor.go:262\ngo.temporal.io/server/service/history.(*queueProcessorBase).processorPump\n\t/temporal/service/history/queueProcessor.go:233"} {"level":"error","ts":"2021-04-26T10:50:08.346Z","msg":"Operation failed with internal error.","service":"history","error":"GetVisibilityTasks operation failed. Select failed. Error: dial tcp: i/o timeout","metric-scope":19,"shard-id":242,"logging-call-at":"persistenceMetricClients.go:676","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:136\ngo.temporal.io/server/common/persistence.(*workflowExecutionPersistenceClient).updateErrorMetric\n\t/temporal/common/persistence/persistenceMetricClients.go:676\ngo.temporal.io/server/common/persistence.(*workflowExecutionPersistenceClient).GetVisibilityTasks\n\t/temporal/common/persistence/persistenceMetricClients.go:419\ngo.temporal.io/server/service/history.(*visibilityQueueProcessorImpl).readTasks\n\t/temporal/service/history/visibilityQueueProcessor.go:320\ngo.temporal.io/server/service/history.(*queueAckMgrImpl).readQueueTasks.func1\n\t/temporal/service/history/queueAckMgr.go:106\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:103\ngo.temporal.io/server/service/history.(*queueAckMgrImpl).readQueueTasks\n\t/temporal/service/history/queueAckMgr.go:110\ngo.temporal.io/server/service/history.(*queueProcessorBase).processBatch\n\t/temporal/service/history/queueProcessor.go:262\ngo.temporal.io/server/service/history.(*queueProcessorBase).processorPump\n\t/temporal/service/history/queueProcessor.go:233"}

We are using version 1.7.1 of the Java SDK, and could really use some help to understand how to fix these issues. Thanks!

Wenquan_Xing · April 26, 2021, 7:28pm

which DB backend are you using?
can you check that DB backend is still accessible?

alec · April 27, 2021, 2:41pm

Hello, we are using MySQL and the database is still accessible. While these DNS errors are repeatedly happening, we are still able to run workflows, ping the services, and receive heartbeats, but naturally we’re wondering how we can resolve this issue.

Wenquan_Xing · April 27, 2021, 4:15pm

UNAVAILABLE: Unable to resolve host temporalio-frontend-headless.temporalio-dev.svc

Error: dial tcp: i/o timeout

Errors are network related.
Are these errors still happening?
Can you try to improve the stability of your network?

Topic		Replies	Views
Temporal worker can not connect to temporal server Server Deployment general-impl	7	4476	October 11, 2023
Context deadline exceeded when trying to start workflow (v1.7.1) Community Support java-sdk	8	2126	April 18, 2024
Front end errors, temporal server unstable during the error window Community Support frontend	1	748	March 2, 2023
Temporal worker not able to connect to internal frontend Community Support worker	3	655	March 5, 2025
Temporal cluster always seems to be out of resources but always seems healthy Community Support general-impl , tctl	4	1939	July 5, 2023

Temporal worker frontend connection fails

Related topics