Deploying in AWS - healthchecks

Does temporal server have any healthcheck endpoint? How can I set up the target groups to hit temporal’s gRPC endpoints?

I see from aws docs - Health checks for your target groups - Elastic Load Balancing

The destination for health checks on the targets.

If the protocol version is HTTP/1.1 or HTTP/2, specify a valid URI (/ *path* ? *query* ). The default is /.

If the protocol version is gRPC, specify the path of a custom health check method with the format `/Package.Class/method` . The default is `/AWS.ALB/healthcheck` .
1 Like

Probably the easiest way to go here is to pass a health check by opening a tcp connection to the grpc service endpoint - if you can connect its “healthy”

I’m not so familiar with gRPC. Coming from mostly HTTP RESTful endpoints. What’s the “grpc service endpoint” you are referring to?

I was expecting something equivalent to /healthcheck for http endpoints.

we don’t have a real grpc health check right now - so best recommendation is just to try and open a connection. the grpc service endpoint i’m talking about is just the temporal endpoint you’d point your client sdk to.

doh - so i was wrong here. temporal itself does have grpc health checks but we’re not using them in our helm charts and instead do a tcp check.

for health checks using target groups you can set the path to: /temporal.api.workflowservice.v1.WorkflowService/Check

My apologies for steering you wrong initially. I’ve added an issue to the helm chart repo to track this also - [Feature Request] Add gRPC heath check via grpc_health_probe · Issue #203 · temporalio/helm-charts · GitHub

Hey @derek , thanks for the follow up. I tried the path you mentioned, but it appears that the target group is failing. Not sure if that path exists.

I tried to look up api/service.proto at master · temporalio/api · GitHub to see if such a healthcheck endpoint exists but found nothing.

Here’s my terraform options I passed to https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/lb_target_group :

path = "/temporal.api.workflowservice.v1.WorkflowService/Check"
matcher = "0"

After some investigation I think the right path would be:

/grpc.health.v1.Health/Check

If you set service field in request message to temporal.api.workflowservice.v1.WorkflowService. it should reply with:

{
  "status": "SERVING"
}

but even if you don’t, it looks like health check responds with 200 even if response message is

{
  "status": "SERVICE_UNKNOWN"
}
2 Likes

thanks this works now!

Did you get it to work just over HTTP2?
How did you get it to work, can you provide curl call?

Just to confirm, with these health checks, should I be able to hit any of the services on “service.name.com:7233/grpc.health.v1.Health/Check” ?

Struggling to find the correct combination to make AWS Application Load Balancers happy. Looks like they expect an HTTP response though… hmm

curl 172.17.0.3:7233/grpc.health.v1.Health/Check -v
*   Trying 172.17.0.3:7233...
* Connected to 172.17.0.3 (172.17.0.3) port 7233 (#0)
> GET /grpc.health.v1.Health/Check HTTP/1.1
> Host: 172.17.0.3:7233
> User-Agent: curl/7.83.1
> Accept: */*
> 
* Received HTTP/0.9 when not allowed
* Closing connection 0
curl: (1) Received HTTP/0.9 when not allowed

I’m not sure if its going to be possible to make this play nice with an application load balancer. A network one may work…

Just to confirm, with these health checks, should I be able to hit any of the services on “service.name.com:7233/grpc.health.v1.Health/Check” ?

Each temporal server service type has default grpc port (see here) which you can change if you needed. Note that the worker service does not expose a health check currently.

The health check request should be grpc so not sure curl would work. For frontend service you could use grpcurl for example:

grpcurl -plaintext -d '{"service": "temporal.api.workflowservice.v1.WorkflowService"}' 127.0.0.1:7233 grpc.health.v1.Health/Check

Don’t think this is possible with matching / history service as they don’t expose reflection api currently. You can however use grpc health probe for all frontend / matching / history:

./grpc-health-probe -addr=127.0.0.1:7233 -service=temporal.api.workflowservice.v1.WorkflowService

./grpc-health-probe -addr=127.0.0.1:7235 -service=temporal.api.workflowservice.v1.MatchingService

./grpc-health-probe -addr=127.0.0.1:7234 -service=temporal.api.workflowservice.v1.HistoryService

(change the address part to fit your setup)

OK, so if I make the GPRC ports available via a network load balancer, that part will work. And I can easily give them a consistent name via DNS.

What about the membership ports? I’m unclear on what the local networking requirements are / how the broadcast address comes into play when running these as separate containers on ECS.

I’m assuming if I gave the membership ports their own network load balancers, that may work, but then how does each service know how to find one another? Since I’m not sure how local broadcasting will work on ECS.

Can I pass an env var to each service that will tell it how to find the other services based on their hostname (and associated ports exposed via the load balancer?)

Would setting TEMPORAL_BROADCAST_ADDRESS to the DNS name work?

Did you use a Network Load Balancer for all the temporal components?

Or is the frontend component the only one behind a load balancer?