Failure to start workflow if WorkflowClient left idle

We have an endpoint which can be called to trigger temporal workflows via the WorkflowClient. We have made sure that all requests use the same instance of the WorkflowClient.

We have found that if the endpoint has not received a request for a period of time e.g. overnight, when it is next called it fails with the following error:

Caused by: io.grpc.StatusRuntimeException: INTERNAL: Panic! This is a bug!
	at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:262)
	at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:243)
	at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:156)
	at io.temporal.api.workflowservice.v1.WorkflowServiceGrpc$WorkflowServiceBlockingStub.startWorkflowExecution(WorkflowServiceGrpc.java:2627)
	at io.temporal.internal.external.GenericWorkflowClientExternalImpl.lambda$start$0(GenericWorkflowClientExternalImpl.java:88)
	at io.temporal.internal.retryer.GrpcSyncRetryer.retry(GrpcSyncRetryer.java:61)
	at io.temporal.internal.retryer.GrpcRetryer.retryWithResult(GrpcRetryer.java:51)
	at io.temporal.internal.external.GenericWorkflowClientExternalImpl.start(GenericWorkflowClientExternalImpl.java:81)
	at io.temporal.internal.client.RootWorkflowClientInvoker.start(RootWorkflowClientInvoker.java:55)
	at io.temporal.internal.sync.WorkflowStubImpl.startWithOptions(WorkflowStubImpl.java:113)
	... 39 more
Caused by: java.lang.IllegalStateException: nameResolver is not started
	at com.google.common.base.Preconditions.checkState(Preconditions.java:502)
	at io.grpc.internal.ManagedChannelImpl.shutdownNameResolverAndLoadBalancer(ManagedChannelImpl.java:360)
	at io.grpc.internal.ManagedChannelImpl.enterIdleMode(ManagedChannelImpl.java:422)
	at io.grpc.internal.ManagedChannelImpl.access$900(ManagedChannelImpl.java:118)
	at io.grpc.internal.ManagedChannelImpl$IdleModeTimer.run(ManagedChannelImpl.java:352)
	at io.grpc.internal.Rescheduler$ChannelFutureRunnable.run(Rescheduler.java:103)
	at io.grpc.SynchronizationContext.drain(SynchronizationContext.java:95)
	at io.grpc.SynchronizationContext.execute(SynchronizationContext.java:127)
	at io.grpc.internal.Rescheduler$FutureRunnable.run(Rescheduler.java:80)
	at io.grpc.netty.shaded.io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
	at io.grpc.netty.shaded.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170)
	at io.grpc.netty.shaded.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
	at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
	at io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
	at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
	at io.grpc.netty.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.grpc.netty.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	... 1 more

Based on the stacktrace it looks as though there is something that is entering an idle mode. This issue is non-recoverable and the pod in kubernetes must be restarted to resolve the issue.

Is this a known issue and are there config settings that can be updated to avoid this issue from occuring?

There was an issue reported for this is java sdk here.
Original cause is grpc-java issue here.

What is the Java SDK version you are using?
The temp fix was included in 1.6.0.

The jdk details are:

openjdk version "1.8.0_332"
OpenJDK Runtime Environment (build 1.8.0_332-b09)
OpenJDK 64-Bit Server VM (build 25.332-b09, mixed mode)

Thanks, I was asking for the Temporal Java SDK version you are using.

Sorry @tihomir , I realized when I re-read your response. We are using 1.5.0 of the temporal-sdk. Sounds like we just need to upgrade this.

1 Like