Unexpected health check behavior when connecting to multiple clusters

I’m working on some code that manages connections to multiple different temporal clusters, and I only see this issue when I’m connecting to two+ clusters. The health check works fine when connecting to a single cluster only.

Unexpected test behavior, using local temporalite clusters:

  1. Start cluster, call health check, returns healthy :white_check_mark:
  2. Turn off cluster, call health check, returns unhealthy :white_check_mark:
  3. Turn on cluster, wait 5-90 seconds (same result expected no matter how long I wait), call health check, returns unhealthy :x:
  4. Leave cluster on, call health check 2nd (or more) times, returns healthy :white_check_mark:

Test code + comments/instructions, run against 2 local temporalite clusters (ports 7233, 7234):

        WorkflowFacade primary = supplier.getHealthyFacade().orElseThrow();
        // Turn off primary cluster
        WorkflowFacade fallback = supplier.getHealthyFacade().orElseThrow();
        assertNotEquals(primary, fallback);
        // Turn the primary cluster on again
        // another == fallback, where it should be primary, because the health check on primary failed.
        WorkflowFacade another = supplier.getHealthyFacade().orElseThrow();
        // This assertion fails
        assertEquals(primary, another);

Error from the health check that ran on line WorkflowFacade another = ..., which is step 3:

Caused by: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
	at io.grpc.Status.asRuntimeException(Status.java:535) ~[grpc-api-1.46.0.jar:1.46.0]
...
Caused by: io.grpc.netty.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /127.0.0.1:7233
Caused by: java.net.ConnectException: Connection refused

Here’s my health check code:

return ListenableFuturesExtra.toCompletableFuture(getService().futureStub()
                    .getClusterInfo(GetClusterInfoRequest.newBuilder().build()))
                    .thenApplyAsync(GetClusterInfoResponse::isInitialized)
                    .orTimeout(2000, TimeUnit.MILLISECONDS);

Any ideas why after the cluster is turned off and back on, the first health check call returns unhealthy?

Unexpected test behavior, using local temporalite clusters

Are you using temporalite from the temporal repo?

Any ideas why after the cluster is turned off and back on, the first health check call returns unhealthy?

Tried to reproduce and was getting

DEADLINE_EXCEEDED: deadline exceeded after 4.960510536s. for about 10 seconds before health check returned success.

Temporalite is experimental and this could be a bug, do you mind opening issue and post your full error trace? Thanks.

I had been using the DataDog repo, which I cannot find anymore. So I moved over to the temporal repo you linked and reproduced the issue.

Done! The first getClusterInfo fails after restarting a temporalite cluster in a multi cluster environment · Issue #1347 · temporalio/sdk-java · GitHub