I’m working on some code that manages connections to multiple different temporal clusters, and I only see this issue when I’m connecting to two+ clusters. The health check works fine when connecting to a single cluster only.
Unexpected test behavior, using local temporalite clusters:
- Start cluster, call health check, returns healthy
- Turn off cluster, call health check, returns unhealthy
- Turn on cluster, wait 5-90 seconds (same result expected no matter how long I wait), call health check, returns unhealthy
- Leave cluster on, call health check 2nd (or more) times, returns healthy
Test code + comments/instructions, run against 2 local temporalite clusters (ports 7233, 7234):
WorkflowFacade primary = supplier.getHealthyFacade().orElseThrow();
// Turn off primary cluster
WorkflowFacade fallback = supplier.getHealthyFacade().orElseThrow();
assertNotEquals(primary, fallback);
// Turn the primary cluster on again
// another == fallback, where it should be primary, because the health check on primary failed.
WorkflowFacade another = supplier.getHealthyFacade().orElseThrow();
// This assertion fails
assertEquals(primary, another);
Error from the health check that ran on line WorkflowFacade another = ...
, which is step 3:
Caused by: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
at io.grpc.Status.asRuntimeException(Status.java:535) ~[grpc-api-1.46.0.jar:1.46.0]
...
Caused by: io.grpc.netty.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /127.0.0.1:7233
Caused by: java.net.ConnectException: Connection refused
Here’s my health check code:
return ListenableFuturesExtra.toCompletableFuture(getService().futureStub()
.getClusterInfo(GetClusterInfoRequest.newBuilder().build()))
.thenApplyAsync(GetClusterInfoResponse::isInitialized)
.orTimeout(2000, TimeUnit.MILLISECONDS);
Any ideas why after the cluster is turned off and back on, the first health check call returns unhealthy?