AWS EKS deployment, gRPC health check failing

Hi - I have a new deployment in AWS (EKS), I have ingress proxying gRPC traffic with a cert, and I can reach the frontend service using grpcurl and grpc-health-probe, but the health check is failing when I start a simple client connection.

c, err := client.NewClient(client.Options{
	HostPort: "fe.temporal.domain.com:443",
})
2021/10/27 22:35:37 http2: Framer 0xc0004e6380: wrote SETTINGS len=0
2021-10-27T22:35:37.291+0100	INFO	zap/logger.go:54	[core] Subchannel Connectivity change to TRANSIENT_FAILURE
2021-10-27T22:35:37.291+0100	INFO	zap/logger.go:54	[transport] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
2021-10-27T22:35:37.291+0100	INFO	zap/logger.go:54	[balancer] base.baseBalancer: handle SubConn state change: 0xc00070e8a0, TRANSIENT_FAILURE	
2021-10-27T22:35:38.394+0100	INFO	zap/logger.go:54	[core] Channel Connectivity change to SHUTDOWN
2021-10-27T22:35:38.394+0100	INFO	zap/logger.go:54	[core] Subchannel Connectivity change to SHUTDOWN
2021/10/27 22:35:38 unable to create Temporal client health check error: last connection error: connection closed
exit status 1

The gprc-health-probe can reach the Temporal service…

grpc-health-probe -addr fe.temporal.domain.com:443 -service temporal.api.workflowservice.v1.WorkflowService -v -tls
parsed options:
> addr=fe.temporal.domain.com:443 conn_timeout=1s rpc_timeout=1s
> tls=true
  > no-verify=false
  > ca-cert=
  > client-cert=
  > client-key=
  > server-name=
> spiffe=false
establishing connection
connection established (took 281.374542ms)
time elapsed: connect=281.374542ms rpc=23.106315ms
status: SERVING

… but not the health probe fails:

grpc-health-probe -addr fe.temporal.domain.com:443 -service grpc.health.v1.Health -v -tls
parsed options:
> addr=fe.temporal.domain.com:443 conn_timeout=1s rpc_timeout=1s
> tls=true
  > no-verify=false
  > ca-cert=
  > client-cert=
  > client-key=
  > server-name=
> spiffe=false
establishing connection
connection established (took 258.169442ms)
service unhealthy (responded with "SERVICE_UNKNOWN")

Has anyone seen this before?

Any help appreciated.

You appear to be setting -tls on the grpc-health-probe but are not setting client.Options.ConnectionOptions.TLS. At least set the latter to an empty &tls.Config{} (the equivalent of -tls for grpc-health-probe) and TLS connections should work.

1 Like

Yep, that was it - many thanks.

For anyone who has the same issue and wants to use the Temporal CLI tctl, set the --tls_server_name to the hostname of your ingress, eg:

tctl --address fe.temporal.domain.com:443 --tls_server_name fe.temporal.domain.com --namespace default namespace register

HTH

1 Like