Temporal worker frontend connection fails

Hi all,

Our temporal worker deployed in kubernetes has been humming along, but for some reason, last night it started to continuously receive errors like this:

18:56:10.136 [Activity Poller taskQueue="singer-activity-task-list", namespace="singer-activity-namespace": 5] ERROR io.temporal.internal.worker.Poller - Failure in thread Activity Poller taskQueue="singer-activity-task-list", namespace="singer-activity-namespace": 5 io.grpc.StatusRuntimeException: UNAVAILABLE: Unable to resolve host temporalio-frontend-headless.temporalio-dev.svc at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:262) at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:243) at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:156) at io.temporal.api.workflowservice.v1.WorkflowServiceGrpc$WorkflowServiceBlockingStub.pollActivityTaskQueue(WorkflowServiceGrpc.java:2683) at io.temporal.internal.worker.ActivityPollTask.poll(ActivityPollTask.java:105) at io.temporal.internal.worker.ActivityPollTask.poll(ActivityPollTask.java:39) at io.temporal.internal.worker.Poller$PollExecutionTask.run(Poller.java:265) at io.temporal.internal.worker.Poller$PollLoopTask.run(Poller.java:241) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.RuntimeException: java.net.UnknownHostException: temporalio-frontend-headless.temporalio-dev.svc: Try again at io.grpc.internal.DnsNameResolver.resolveAddresses(DnsNameResolver.java:223) at io.grpc.internal.DnsNameResolver.doResolve(DnsNameResolver.java:282) at io.grpc.internal.DnsNameResolver$Resolve.run(DnsNameResolver.java:318) ... 3 common frames omitted Caused by: java.net.UnknownHostException: temporalio-frontend-headless.temporalio-dev.svc: Try again at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method) at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929) at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324) at java.net.InetAddress.getAllByName0(InetAddress.java:1277) at java.net.InetAddress.getAllByName(InetAddress.java:1193) at java.net.InetAddress.getAllByName(InetAddress.java:1127) at io.grpc.internal.DnsNameResolver$JdkAddressResolver.resolveAddress(DnsNameResolver.java:631) at io.grpc.internal.DnsNameResolver.resolveAddresses(DnsNameResolver.java:219) ... 5 common frames omitted

However, despite these errors, it is still able to create workflows and communicate with them. I can’t understand why this is suddenly happening as no pods went down before these errors popped up.

Here’s logs from the frontend service, which was timing out:
{"level":"error","ts":"2021-04-26T12:18:18.017Z","msg":"Operation failed with internal error.","service":"frontend","error":"GetMetadata operation failed. Error: dial tcp: i/o timeout","metric-scope":52,"logging-call-at":"persistenceMetricClients.go:932","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:136\ngo.temporal.io/server/common/persistence.(*metadataPersistenceClient).updateErrorMetric\n\t/temporal/common/persistence/persistenceMetricClients.go:932\ngo.temporal.io/server/common/persistence.(*metadataPersistenceClient).GetMetadata\n\t/temporal/common/persistence/persistenceMetricClients.go:910\ngo.temporal.io/server/common/cache.(*namespaceCache).refreshNamespacesLocked\n\t/temporal/common/cache/namespaceCache.go:435\ngo.temporal.io/server/common/cache.(*namespaceCache).refreshNamespaces\n\t/temporal/common/cache/namespaceCache.go:425\ngo.temporal.io/server/common/cache.(*namespaceCache).refreshLoop\n\t/temporal/common/cache/namespaceCache.go:409"} {"level":"error","ts":"2021-04-26T12:18:18.017Z","msg":"Error refreshing namespace cache","service":"frontend","error":"GetMetadata operation failed. Error: dial tcp: i/o timeout","logging-call-at":"namespaceCache.go:414","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:136\ngo.temporal.io/server/common/cache.(*namespaceCache).refreshLoop\n\t/temporal/common/cache/namespaceCache.go:414"} {"level":"error","ts":"2021-04-26T12:18:20.914Z","msg":"Membership upsert failed.","service":"frontend","error":"UpsertClusterMembership operation failed. Error: dial tcp: i/o timeout","logging-call-at":"rpMonitor.go:276","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:136\ngo.temporal.io/server/common/membership.(*ringpopMonitor).startHeartbeatUpsertLoop.func1\n\t/temporal/common/membership/rpMonitor.go:276"}

And the history service:
{"level":"error","ts":"2021-04-26T10:50:08.172Z","msg":"Operation failed with internal error.","service":"history","error":"GetTransferTasks operation failed. Select failed. Error: dial tcp: i/o timeout","metric-scope":15,"shard-id":144,"logging-call-at":"persistenceMetricClients.go:676","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:136\ngo.temporal.io/server/common/persistence.(*workflowExecutionPersistenceClient).updateErrorMetric\n\t/temporal/common/persistence/persistenceMetricClients.go:676\ngo.temporal.io/server/common/persistence.(*workflowExecutionPersistenceClient).GetTransferTasks\n\t/temporal/common/persistence/persistenceMetricClients.go:391\ngo.temporal.io/server/service/history.(*transferQueueProcessorBase).readTasks\n\t/temporal/service/history/transferQueueProcessorBase.go:81\ngo.temporal.io/server/service/history.(*queueAckMgrImpl).readQueueTasks.func1\n\t/temporal/service/history/queueAckMgr.go:106\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:103\ngo.temporal.io/server/service/history.(*queueAckMgrImpl).readQueueTasks\n\t/temporal/service/history/queueAckMgr.go:110\ngo.temporal.io/server/service/history.(*queueProcessorBase).processBatch\n\t/temporal/service/history/queueProcessor.go:262\ngo.temporal.io/server/service/history.(*queueProcessorBase).processorPump\n\t/temporal/service/history/queueProcessor.go:233"} {"level":"error","ts":"2021-04-26T10:50:08.346Z","msg":"Operation failed with internal error.","service":"history","error":"GetVisibilityTasks operation failed. Select failed. Error: dial tcp: i/o timeout","metric-scope":19,"shard-id":242,"logging-call-at":"persistenceMetricClients.go:676","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:136\ngo.temporal.io/server/common/persistence.(*workflowExecutionPersistenceClient).updateErrorMetric\n\t/temporal/common/persistence/persistenceMetricClients.go:676\ngo.temporal.io/server/common/persistence.(*workflowExecutionPersistenceClient).GetVisibilityTasks\n\t/temporal/common/persistence/persistenceMetricClients.go:419\ngo.temporal.io/server/service/history.(*visibilityQueueProcessorImpl).readTasks\n\t/temporal/service/history/visibilityQueueProcessor.go:320\ngo.temporal.io/server/service/history.(*queueAckMgrImpl).readQueueTasks.func1\n\t/temporal/service/history/queueAckMgr.go:106\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:103\ngo.temporal.io/server/service/history.(*queueAckMgrImpl).readQueueTasks\n\t/temporal/service/history/queueAckMgr.go:110\ngo.temporal.io/server/service/history.(*queueProcessorBase).processBatch\n\t/temporal/service/history/queueProcessor.go:262\ngo.temporal.io/server/service/history.(*queueProcessorBase).processorPump\n\t/temporal/service/history/queueProcessor.go:233"}

We are using version 1.7.1 of the Java SDK, and could really use some help to understand how to fix these issues. Thanks!

which DB backend are you using?
can you check that DB backend is still accessible?

Hello, we are using MySQL and the database is still accessible. While these DNS errors are repeatedly happening, we are still able to run workflows, ping the services, and receive heartbeats, but naturally we’re wondering how we can resolve this issue.

UNAVAILABLE: Unable to resolve host temporalio-frontend-headless.temporalio-dev.svc
Error: dial tcp: i/o timeout

Errors are network related.
Are these errors still happening?
Can you try to improve the stability of your network?