io.grpc.StatusRuntimeException: UNKNOWN: shard closed

When ever we run performance test we observer the below error.

Questions

  • Could you please help us understand the cause of this error.?

  • What should we scale to fix this?

Unable to start the workflow:\nio.temporal.client.WorkflowServiceException: workflowId=TEST', runId='', workflowType='InstanceWorkflow'}\n\tat io.temporal.internal.sync.WorkflowStubImpl.wrapStartException(WorkflowStubImpl.java:184)\n\tat io.temporal.internal.sync.WorkflowStubImpl.startWithOptions(WorkflowStubImpl.java:120)\n\tat io.temporal.internal.sync.WorkflowStubImpl.start(WorkflowStubImpl.java:138)\n\tat io.temporal.internal.sync.WorkflowInvocationHandler$StartWorkflowInvocationHandler.invoke(WorkflowInvocationHandler.java:242)\n\tat io.temporal.internal.sync.WorkflowInvocationHandler.invoke(WorkflowInvocationHandler.java:178)\n\tat com.sun.proxy.$Proxy176.execute(Unknown Source)\n\tat io.temporal.internal.sync.WorkflowClientInternal.lambda$start$4ed02937$1(WorkflowClientInternal.java:308)\n\tat io.temporal.internal.sync.WorkflowClientInternal.start(WorkflowClientInternal.java:256)\n\tat io.temporal.internal.sync.WorkflowClientInternal.start(WorkflowClientInternal.java:299)\n\tat io.temporal.internal.sync.WorkflowClientInternal.start(WorkflowClientInternal.java:308)\n\tat io.temporal.client.WorkflowClient.start(WorkflowClient.java:382)\n\tat com..coordinator.workflow.IAPVWorkflowExecutor.executeWorkflow(IAPVWorkflowExecutor.java:38)\n\tat com.coordinator.service.workflow.WorkflowStartServiceImpl.start(WorkflowStartServiceImpl.java:30)\n\tat com..ECWorkflowServiceGrpc$MethodHandlers.invoke(ECWorkflowServiceGrpc.java:217)\n\tat io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182)\n\tat io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)\n\tat io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)\n\tat io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)\n\tat io.grpc.Contexts$ContextualizedServerCallListener.onHalfClose(Contexts.java:86)\n\tat io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:340)\n\tat io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:866)\n\tat io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)\n\tat io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n\tat java.base/java.lang.Thread.run(Thread.java:834)\nCaused by: **io.grpc.StatusRuntimeException: UNKNOWN: shard closed** \n\tat io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:262)\n\tat io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:243)\n\tat io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:156)\n\tat io.temporal.api.workflowservice.v1.WorkflowServiceGrpc$WorkflowServiceBlockingStub.startWorkflowExecution(WorkflowServiceGrpc.java:2631)\n\tat io.temporal.internal.external.GenericWorkflowClientExternalImpl.lambda$start$0(GenericWorkflowClientExternalImpl.java:88)\n\tat io.temporal.internal.retryer.GrpcSyncRetryer.retry(GrpcSyncRetryer.java:61)\n\tat io.temporal.internal.retryer.GrpcRetryer.retryWithResult(GrpcRetryer.java:51)\n\tat io.temporal.internal.external.GenericWorkflowClientExternalImpl.start(GenericWorkflowClientExternalImpl.java:81)\n\tat io.temporal.internal.client.RootWorkflowClientInvoker.start(RootWorkflowClientInvoker.java:55)\n\tat io.temporal.internal.sync.WorkflowStubImpl.startWithOptions(WorkflowStubImpl.java:113)\n\t... 24 common frames omitted","Kubernetes.namespace":"test” timestamp":"2022-01-14T21:49:32.546Z","@version":"1","s_sourcetype":"bifrost"}

@tihomir @maxim

Any help here?

This error is returned to the client if request reaches history node when that node is shutting down (during deployment/redeployment, for instance). This error should not be exposed to the client, and the request should be just retried.

What server version are you using? I believe this was fixed in 1.14.0 via this commit.

1 Like

We are on 1.12.0

Makes sense, we have seen our History Node being recyled frequently during load testing.

Is the intent of retrying is to route the request to a live history node?

Yes, or to retry until deployed/redeployed one is up and running.