We have upgraded to 1.18.2 server image and as per instructions, we have removed the publicClient section from the config. We use tls and point directly to frontend but not loadbalancer. Post this worker pod is failing to connect to frontend. Below is the message in worker pod
We have an internode routing to the host override cause we use JWT for the user auth so default + internode will do mTLS
Here is the tls config
tls:
internode:
server:
requireClientAuth: true
certFile: {{ include "temporal.internode-certificates.certFile" $ }}
keyFile: {{ include "temporal.internode-certificates.keyFile" $ }}
clientCaFiles:
- {{ include "temporal.internode-certificates.caFiles" $ }}
client:
serverName: temporal-internode.cluster
rootCaFiles:
- {{ include "temporal.internode-certificates.caFiles" $ }}
frontend:
server:
requireClientAuth: true
certFile: {{ include "temporal.internode-certificates.certFile" $ }}
keyFile: {{ include "temporal.internode-certificates.keyFile" $ }}
clientCaFiles:
- {{ include "temporal.internode-certificates.caFiles" $ }}
hostOverrides:
temporal-bench.{{ $.Release.Namespace }}.svc:
requireClientAuth: true
certFile: {{ include "temporal.internode-certificates.certFile" $ }}
keyFile: {{ include "temporal.internode-certificates.keyFile" $ }}
clientCaFiles:
- {{ include "temporal.internode-certificates.caFiles" $ }}
temporal-relay.{{ $.Release.Namespace }}.svc:
requireClientAuth: true
certFile: {{ include "temporal.internode-certificates.certFile" $ }}
keyFile: {{ include "temporal.internode-certificates.keyFile" $ }}
clientCaFiles:
- {{ include "temporal.internode-certificates.caFiles" $ }}
temporal-internode.cluster:
requireClientAuth: true
certFile: {{ include "temporal.internode-certificates.certFile" $ }}
keyFile: {{ include "temporal.internode-certificates.keyFile" $ }}
clientCaFiles:
- {{ include "temporal.internode-certificates.caFiles" $ }}
{{ $.Values.server.tls.hostOverride }}:
requireClientAuth: false
certFile: {{ include "temporal.frontend-certificates.certFile" $ }}
keyFile: {{ include "temporal.frontend-certificates.keyFile" $ }}
clientCaFiles:
- {{ include "temporal.frontend-certificates.caFiles" $ }}
{{ $.Values.server.tls.internodeHostOverride }}:
requireClientAuth: true
certFile: {{ include "temporal.internode-certificates.certFile" $ }}
keyFile: {{ include "temporal.internode-certificates.keyFile" $ }}
clientCaFiles:
- {{ include "temporal.internode-certificates.caFiles" $ }}
{{ include "temporal.fullname" $ }}-frontend.{{ $.Release.Namespace }}.svc:
requireClientAuth: false
certFile: {{ include "temporal.frontend-certificates.certFile" $ }}
keyFile: {{ include "temporal.frontend-certificates.keyFile" $ }}
clientCaFiles:
- {{ include "temporal.frontend-certificates.caFiles" $ }}
systemWorker:
certFile: {{ include "temporal.internode-certificates.certFile" $ }}
keyFile: {{ include "temporal.internode-certificates.keyFile" $ }}
client:
serverName: temporal-internode.cluster
rootCaFiles:
- {{ include "temporal.internode-certificates.caFiles" $ }}
We had to update the cluster metadata earlier cause schedules API was looking for the remote address in the rpcAddress and the attribute value was set to a non-resolvable hostPort, in case this is relevant
In theory what might cause this is the worker not having the similar Client SDK TLS response handler to the other server components as we have seen a similar message when the FQDN/IP of the host accessed did not match the response certificate AltDNS.
E.g. if resolver resolves to the list of IP addresses and worker hits IP then the client will not continue as Frontend does not have IP in the certificate’s AltDNS
@Yimin_Chen, @alex - can you take a look please in case following makes sense?
Here is a theory I have which may or may not be valid but I have no alternative explanation of the event
System worker is making an attempt to connect with the Frontend as we can see the connection attempt in the Frontend logs if we set a serverName to something not lister in the hostOverride. The session terminates during the handshake due to a hostname validation failure as a certificate received by the client back from the frontend does not have a host ip (for apparent reasons) which new service discovery resolves to.
If I got it right (GH reference below) the client TLS is formed with DisableHostVerification set to false which results in a TLS handshake failure of the Worker SDK connection. To handle the IP:PORT scenario which new discovery provides DisableHostVerification may need to be set to true + have a VerifyConnection override which covers the validation of the pre-defined DNS (or some alternative solution).
@Kishore_Gunda please put back the publicClient as workaround for now. We are looking into this issue, and we won’t remove support for publicClient before we clear up all known issues.
It does look like 1.18.3 have it sorted, at least on my local cluster which mimics the real thing removing the publicClient does not leave worker crashing.
We will do another sandbox deployment to confirm if this still an issue post 1.18.3.