Continuous frontend connection timeout

Hi all,

We recently deployed a new version of our temporal worker and started a couple workflows to run overnight. This morning, we found they had not executed and the program was encountering endless errors like this:

22:31:48.639 [Workflow Poller taskQueue="singer-activity-task-list", namespace="singer-activity-    namespace": 2] ERROR io.temporal.internal.worker.Poller - Failure in thread Workflow Poller taskQueue="singer-activity-task-list", namespace="singer-activity-namespace": 2
io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
	at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:262)
	at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:243)
	at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:156)
	at io.temporal.api.workflowservice.v1.WorkflowServiceGrpc$WorkflowServiceBlockingStub.pollWorkflowTaskQueue(WorkflowServiceGrpc.java:2639)
	at io.temporal.internal.worker.WorkflowPollTask.poll(WorkflowPollTask.java:81)
	at io.temporal.internal.worker.WorkflowPollTask.poll(WorkflowPollTask.java:37)
	at io.temporal.internal.worker.Poller$PollExecutionTask.run(Poller.java:265)
	at io.temporal.internal.worker.Poller$PollLoopTask.run(Poller.java:241)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: io.grpc.netty.shaded.io.netty.channel.ConnectTimeoutException: connection timed out: temporalio-frontend-headless.temporalio-prod.svc/100.96.52.10:7233
	at io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe$2.run(AbstractEpollChannel.java:575)
	at io.grpc.netty.shaded.io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
	at io.grpc.netty.shaded.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170)
	at io.grpc.netty.shaded.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
	at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
	at io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
	at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
	at io.grpc.netty.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.grpc.netty.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	... 1 common frames omitted
22:31:48.639 [Activity Poller taskQueue="singer-activity-task-list", namespace="singer-activity-namespace": 4] ERROR io.temporal.internal.worker.Poller - Failure in thread Activity Poller taskQueue="singer-activity-task-list", namespace="singer-activity-namespace": 4
io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
	at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:262)
	at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:243)
	at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:156)
	at io.temporal.api.workflowservice.v1.WorkflowServiceGrpc$WorkflowServiceBlockingStub.pollActivityTaskQueue(WorkflowServiceGrpc.java:2683)
	at io.temporal.internal.worker.ActivityPollTask.poll(ActivityPollTask.java:105)
	at io.temporal.internal.worker.ActivityPollTask.poll(ActivityPollTask.java:39)
	at io.temporal.internal.worker.Poller$PollExecutionTask.run(Poller.java:265)
	at io.temporal.internal.worker.Poller$PollLoopTask.run(Poller.java:241)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: io.grpc.netty.shaded.io.netty.channel.ConnectTimeoutException: connection timed out: temporalio-frontend-headless.temporalio-prod.svc/100.96.52.10:7233
	at io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe$2.run(AbstractEpollChannel.java:575)
	at io.grpc.netty.shaded.io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
	at io.grpc.netty.shaded.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170)
	at io.grpc.netty.shaded.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
	at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
	at io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
	at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
	at io.grpc.netty.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.grpc.netty.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	... 1 common frames omitted

The closest thing we could find to a culprit is this error in our mysql pod:

{"level":"error","ts":"2021-03-23T00:00:00.736Z","msg":"Operation failed with internal error.","service":"worker","error":"Error 1040: Too many connections","metric-scope":39,"logging-call-at":"persistenceMetricClients.go:804","stacktrace":"go.temporal.io/server/common/log/loggerimpl.(*loggerImpl).Error\n\t/temporal/common/log/loggerimpl/logger.go:138\ngo.temporal.io/server/common/persistence.(*taskPersistenceClient).updateErrorMetric\n\t/temporal/common/persistence/persistenceMetricClients.go:804\ngo.temporal.io/server/common/persistence.(*taskPersistenceClient).ListTaskQueue\n\t/temporal/common/persistence/persistenceMetricClients.go:763\ngo.temporal.io/server/service/worker/scanner/taskqueue.(*Scavenger).listTaskQueue.func1\n\t/temporal/service/worker/scanner/taskqueue/db.go:74\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:103\ngo.temporal.io/server/service/worker/scanner/taskqueue.(*Scavenger).retryForever\n\t/temporal/service/worker/scanner/taskqueue/db.go:101\ngo.temporal.io/server/service/worker/scanner/taskqueue.(*Scavenger).listTaskQueue\n\t/temporal/service/worker/scanner/taskqueue/db.go:73\ngo.temporal.io/server/service/worker/scanner/taskqueue.(*Scavenger).run\n\t/temporal/service/worker/scanner/taskqueue/scavenger.go:153"}

I noticed in another thread that a similar issue was found to be caused by unconfigured connection settings in the mysql deployment yaml, but ours is set for 30 max connections in production. Thanks for taking a look and hope you all can shed some light on this :slightly_smiling_face:

try to decrease the number of conn on the server side, 30 is pretty large