Hi all,
Our temporal worker deployed in kubernetes has been humming along, but for some reason, last night it started to continuously receive errors like this:
18:56:10.136 [Activity Poller taskQueue="singer-activity-task-list", namespace="singer-activity-namespace": 5] ERROR io.temporal.internal.worker.Poller - Failure in thread Activity Poller taskQueue="singer-activity-task-list", namespace="singer-activity-namespace": 5 io.grpc.StatusRuntimeException: UNAVAILABLE: Unable to resolve host temporalio-frontend-headless.temporalio-dev.svc at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:262) at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:243) at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:156) at io.temporal.api.workflowservice.v1.WorkflowServiceGrpc$WorkflowServiceBlockingStub.pollActivityTaskQueue(WorkflowServiceGrpc.java:2683) at io.temporal.internal.worker.ActivityPollTask.poll(ActivityPollTask.java:105) at io.temporal.internal.worker.ActivityPollTask.poll(ActivityPollTask.java:39) at io.temporal.internal.worker.Poller$PollExecutionTask.run(Poller.java:265) at io.temporal.internal.worker.Poller$PollLoopTask.run(Poller.java:241) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.RuntimeException: java.net.UnknownHostException: temporalio-frontend-headless.temporalio-dev.svc: Try again at io.grpc.internal.DnsNameResolver.resolveAddresses(DnsNameResolver.java:223) at io.grpc.internal.DnsNameResolver.doResolve(DnsNameResolver.java:282) at io.grpc.internal.DnsNameResolver$Resolve.run(DnsNameResolver.java:318) ... 3 common frames omitted Caused by: java.net.UnknownHostException: temporalio-frontend-headless.temporalio-dev.svc: Try again at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method) at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929) at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324) at java.net.InetAddress.getAllByName0(InetAddress.java:1277) at java.net.InetAddress.getAllByName(InetAddress.java:1193) at java.net.InetAddress.getAllByName(InetAddress.java:1127) at io.grpc.internal.DnsNameResolver$JdkAddressResolver.resolveAddress(DnsNameResolver.java:631) at io.grpc.internal.DnsNameResolver.resolveAddresses(DnsNameResolver.java:219) ... 5 common frames omitted
However, despite these errors, it is still able to create workflows and communicate with them. I can’t understand why this is suddenly happening as no pods went down before these errors popped up.
Here’s logs from the frontend service, which was timing out:
{"level":"error","ts":"2021-04-26T12:18:18.017Z","msg":"Operation failed with internal error.","service":"frontend","error":"GetMetadata operation failed. Error: dial tcp: i/o timeout","metric-scope":52,"logging-call-at":"persistenceMetricClients.go:932","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:136\ngo.temporal.io/server/common/persistence.(*metadataPersistenceClient).updateErrorMetric\n\t/temporal/common/persistence/persistenceMetricClients.go:932\ngo.temporal.io/server/common/persistence.(*metadataPersistenceClient).GetMetadata\n\t/temporal/common/persistence/persistenceMetricClients.go:910\ngo.temporal.io/server/common/cache.(*namespaceCache).refreshNamespacesLocked\n\t/temporal/common/cache/namespaceCache.go:435\ngo.temporal.io/server/common/cache.(*namespaceCache).refreshNamespaces\n\t/temporal/common/cache/namespaceCache.go:425\ngo.temporal.io/server/common/cache.(*namespaceCache).refreshLoop\n\t/temporal/common/cache/namespaceCache.go:409"} {"level":"error","ts":"2021-04-26T12:18:18.017Z","msg":"Error refreshing namespace cache","service":"frontend","error":"GetMetadata operation failed. Error: dial tcp: i/o timeout","logging-call-at":"namespaceCache.go:414","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:136\ngo.temporal.io/server/common/cache.(*namespaceCache).refreshLoop\n\t/temporal/common/cache/namespaceCache.go:414"} {"level":"error","ts":"2021-04-26T12:18:20.914Z","msg":"Membership upsert failed.","service":"frontend","error":"UpsertClusterMembership operation failed. Error: dial tcp: i/o timeout","logging-call-at":"rpMonitor.go:276","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:136\ngo.temporal.io/server/common/membership.(*ringpopMonitor).startHeartbeatUpsertLoop.func1\n\t/temporal/common/membership/rpMonitor.go:276"}
And the history service:
{"level":"error","ts":"2021-04-26T10:50:08.172Z","msg":"Operation failed with internal error.","service":"history","error":"GetTransferTasks operation failed. Select failed. Error: dial tcp: i/o timeout","metric-scope":15,"shard-id":144,"logging-call-at":"persistenceMetricClients.go:676","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:136\ngo.temporal.io/server/common/persistence.(*workflowExecutionPersistenceClient).updateErrorMetric\n\t/temporal/common/persistence/persistenceMetricClients.go:676\ngo.temporal.io/server/common/persistence.(*workflowExecutionPersistenceClient).GetTransferTasks\n\t/temporal/common/persistence/persistenceMetricClients.go:391\ngo.temporal.io/server/service/history.(*transferQueueProcessorBase).readTasks\n\t/temporal/service/history/transferQueueProcessorBase.go:81\ngo.temporal.io/server/service/history.(*queueAckMgrImpl).readQueueTasks.func1\n\t/temporal/service/history/queueAckMgr.go:106\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:103\ngo.temporal.io/server/service/history.(*queueAckMgrImpl).readQueueTasks\n\t/temporal/service/history/queueAckMgr.go:110\ngo.temporal.io/server/service/history.(*queueProcessorBase).processBatch\n\t/temporal/service/history/queueProcessor.go:262\ngo.temporal.io/server/service/history.(*queueProcessorBase).processorPump\n\t/temporal/service/history/queueProcessor.go:233"} {"level":"error","ts":"2021-04-26T10:50:08.346Z","msg":"Operation failed with internal error.","service":"history","error":"GetVisibilityTasks operation failed. Select failed. Error: dial tcp: i/o timeout","metric-scope":19,"shard-id":242,"logging-call-at":"persistenceMetricClients.go:676","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:136\ngo.temporal.io/server/common/persistence.(*workflowExecutionPersistenceClient).updateErrorMetric\n\t/temporal/common/persistence/persistenceMetricClients.go:676\ngo.temporal.io/server/common/persistence.(*workflowExecutionPersistenceClient).GetVisibilityTasks\n\t/temporal/common/persistence/persistenceMetricClients.go:419\ngo.temporal.io/server/service/history.(*visibilityQueueProcessorImpl).readTasks\n\t/temporal/service/history/visibilityQueueProcessor.go:320\ngo.temporal.io/server/service/history.(*queueAckMgrImpl).readQueueTasks.func1\n\t/temporal/service/history/queueAckMgr.go:106\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:103\ngo.temporal.io/server/service/history.(*queueAckMgrImpl).readQueueTasks\n\t/temporal/service/history/queueAckMgr.go:110\ngo.temporal.io/server/service/history.(*queueProcessorBase).processBatch\n\t/temporal/service/history/queueProcessor.go:262\ngo.temporal.io/server/service/history.(*queueProcessorBase).processorPump\n\t/temporal/service/history/queueProcessor.go:233"}
We are using version 1.7.1 of the Java SDK, and could really use some help to understand how to fix these issues. Thanks!