Unexpected health check behavior when connecting to multiple clusters

nathan_shields · August 1, 2022, 7:35pm

I’m working on some code that manages connections to multiple different temporal clusters, and I only see this issue when I’m connecting to two+ clusters. The health check works fine when connecting to a single cluster only.

Unexpected test behavior, using local temporalite clusters:

Start cluster, call health check, returns healthy
Turn off cluster, call health check, returns unhealthy
Turn on cluster, wait 5-90 seconds (same result expected no matter how long I wait), call health check, returns unhealthy
Leave cluster on, call health check 2nd (or more) times, returns healthy

Test code + comments/instructions, run against 2 local temporalite clusters (ports 7233, 7234):

        WorkflowFacade primary = supplier.getHealthyFacade().orElseThrow();
        // Turn off primary cluster
        WorkflowFacade fallback = supplier.getHealthyFacade().orElseThrow();
        assertNotEquals(primary, fallback);
        // Turn the primary cluster on again
        // another == fallback, where it should be primary, because the health check on primary failed.
        WorkflowFacade another = supplier.getHealthyFacade().orElseThrow();
        // This assertion fails
        assertEquals(primary, another);

Error from the health check that ran on line WorkflowFacade another = ..., which is step 3:

Caused by: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
	at io.grpc.Status.asRuntimeException(Status.java:535) ~[grpc-api-1.46.0.jar:1.46.0]
...
Caused by: io.grpc.netty.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /127.0.0.1:7233
Caused by: java.net.ConnectException: Connection refused

Here’s my health check code:

return ListenableFuturesExtra.toCompletableFuture(getService().futureStub()
                    .getClusterInfo(GetClusterInfoRequest.newBuilder().build()))
                    .thenApplyAsync(GetClusterInfoResponse::isInitialized)
                    .orTimeout(2000, TimeUnit.MILLISECONDS);

Any ideas why after the cluster is turned off and back on, the first health check call returns unhealthy?

tihomir · August 3, 2022, 6:37pm

Unexpected test behavior, using local temporalite clusters

Are you using temporalite from the temporal repo?

Any ideas why after the cluster is turned off and back on, the first health check call returns unhealthy?

Tried to reproduce and was getting

DEADLINE_EXCEEDED: deadline exceeded after 4.960510536s. for about 10 seconds before health check returned success.

Temporalite is experimental and this could be a bug, do you mind opening issue and post your full error trace? Thanks.

nathan_shields · August 3, 2022, 10:14pm

I had been using the DataDog repo, which I cannot find anymore. So I moved over to the temporal repo you linked and reproduced the issue.

Done! https://github.com/temporalio/temporalite/issues/101

Topic		Replies	Views
Health check java-sdk Community Support java-sdk	1	649	February 16, 2022
Temporal Client/Worker health-check Community Support java-sdk	8	3688	January 21, 2021
Connection failure Community Support go-sdk	1	1240	October 15, 2021
Helath Check failure Community Support go-sdk	6	1379	February 22, 2021
Temporal Server Health Check Community Support production	6	7195	January 13, 2024

Unexpected health check behavior when connecting to multiple clusters

Related topics