Temporal worker failing to connect to frontend in 1.18.2 post removing publicClient from config

Hi,

We have upgraded to 1.18.2 server image and as per instructions, we have removed the publicClient section from the config. We use tls and point directly to frontend but not loadbalancer. Post this worker pod is failing to connect to frontend. Below is the message in worker pod

{“level”:“warn”,“ts”:“2022-10-20T08:51:30.636Z”,“msg”:“error creating sdk client”,“service”:“worker”,“error”:“failed reaching server: context deadline exceeded”,“logging-call-at”:“factory.go:116”}

{“level”:“warn”,“ts”:“2022-10-20T08:51:40.097Z”,“msg”:“error creating sdk client”,“service”:“worker”,“error”:“failed reaching server: context deadline exceeded”,“logging-call-at”:“factory.go:116”

@Andrey_Dubnik - fyi

We have an internode routing to the host override cause we use JWT for the user auth so default + internode will do mTLS

Here is the tls config

     tls:
        internode:
          server:
            requireClientAuth: true
            certFile: {{ include "temporal.internode-certificates.certFile" $ }}
            keyFile: {{ include "temporal.internode-certificates.keyFile" $ }}
            clientCaFiles:
              - {{ include "temporal.internode-certificates.caFiles" $ }}
          client:
            serverName: temporal-internode.cluster
            rootCaFiles:
              - {{ include "temporal.internode-certificates.caFiles" $ }}
        frontend:
          server:
            requireClientAuth: true
            certFile: {{ include "temporal.internode-certificates.certFile" $ }}
            keyFile: {{ include "temporal.internode-certificates.keyFile" $ }}
            clientCaFiles:
              - {{ include "temporal.internode-certificates.caFiles" $ }}
          hostOverrides:
            temporal-bench.{{ $.Release.Namespace }}.svc:
              requireClientAuth: true
              certFile: {{ include "temporal.internode-certificates.certFile" $ }}
              keyFile: {{ include "temporal.internode-certificates.keyFile" $ }}
              clientCaFiles:
                - {{ include "temporal.internode-certificates.caFiles" $ }}
            temporal-relay.{{ $.Release.Namespace }}.svc:
              requireClientAuth: true
              certFile: {{ include "temporal.internode-certificates.certFile" $ }}
              keyFile: {{ include "temporal.internode-certificates.keyFile" $ }}
              clientCaFiles:
                - {{ include "temporal.internode-certificates.caFiles" $ }}    
            temporal-internode.cluster:
              requireClientAuth: true
              certFile: {{ include "temporal.internode-certificates.certFile" $ }}
              keyFile: {{ include "temporal.internode-certificates.keyFile" $ }}
              clientCaFiles:
                - {{ include "temporal.internode-certificates.caFiles" $ }}
            {{ $.Values.server.tls.hostOverride }}:
              requireClientAuth: false
              certFile: {{ include "temporal.frontend-certificates.certFile" $ }}
              keyFile: {{ include "temporal.frontend-certificates.keyFile" $ }}
              clientCaFiles:
                - {{ include "temporal.frontend-certificates.caFiles" $ }}
            {{ $.Values.server.tls.internodeHostOverride }}:
              requireClientAuth: true
              certFile: {{ include "temporal.internode-certificates.certFile" $ }}
              keyFile: {{ include "temporal.internode-certificates.keyFile" $ }}
              clientCaFiles:
                - {{ include "temporal.internode-certificates.caFiles" $ }}
            {{ include "temporal.fullname" $ }}-frontend.{{ $.Release.Namespace }}.svc:
              requireClientAuth: false
              certFile: {{ include "temporal.frontend-certificates.certFile" $ }}
              keyFile: {{ include "temporal.frontend-certificates.keyFile" $ }}
              clientCaFiles:
                - {{ include "temporal.frontend-certificates.caFiles" $ }}    
        systemWorker:
          certFile: {{ include "temporal.internode-certificates.certFile" $ }}
          keyFile: {{ include "temporal.internode-certificates.keyFile" $ }}
          client:
            serverName: temporal-internode.cluster
            rootCaFiles:
              - {{ include "temporal.internode-certificates.caFiles" $ }}

We had to update the cluster metadata earlier cause schedules API was looking for the remote address in the rpcAddress and the attribute value was set to a non-resolvable hostPort, in case this is relevant

{{- if $.Values.server.config.clusterMetadata }}
    clusterMetadata:
    {{- with $.Values.server.config.clusterMetadata }}
    {{- toYaml . | nindent 8 }}
    {{- end }}
    {{- else }}
    clusterMetadata:
      enableGlobalNamespace: true
      failoverVersionIncrement: 10
      masterClusterName: {{ $.Values.server.masterClusterName }}
      currentClusterName: {{ $.Values.server.currentClusterName }}
      clusterInformation:
        {{ $.Values.server.currentClusterName }}:
          enabled: true
          initialFailoverVersion: {{ $.Values.server.initialFailoverVersion }}
          rpcName: "temporal-frontend"
          rpcAddress: "temporal-relay.temporal.svc:7233"
    {{- end }}

In theory what might cause this is the worker not having the similar Client SDK TLS response handler to the other server components as we have seen a similar message when the FQDN/IP of the host accessed did not match the response certificate AltDNS.

E.g. if resolver resolves to the list of IP addresses and worker hits IP then the client will not continue as Frontend does not have IP in the certificate’s AltDNS

@Yimin_Chen, @alex - can you take a look please in case following makes sense?

Here is a theory I have which may or may not be valid but I have no alternative explanation of the event

System worker is making an attempt to connect with the Frontend as we can see the connection attempt in the Frontend logs if we set a serverName to something not lister in the hostOverride. The session terminates during the handshake due to a hostname validation failure as a certificate received by the client back from the frontend does not have a host ip (for apparent reasons) which new service discovery resolves to.

If I got it right (GH reference below) the client TLS is formed with DisableHostVerification set to false which results in a TLS handshake failure of the Worker SDK connection. To handle the IP:PORT scenario which new discovery provides DisableHostVerification may need to be set to true + have a VerifyConnection override which covers the validation of the pre-defined DNS (or some alternative solution).

Also submitted a bug as there does not seem to be any workaround available System worker fails to connect to a Frontend in the TLS enabled cluster after removing the publicClient settings · Issue #3527 · temporalio/temporal (github.com)

@Kishore_Gunda please put back the publicClient as workaround for now. We are looking into this issue, and we won’t remove support for publicClient before we clear up all known issues.

1 Like

It does look like 1.18.3 have it sorted, at least on my local cluster which mimics the real thing removing the publicClient does not leave worker crashing.

We will do another sandbox deployment to confirm if this still an issue post 1.18.3.

Thanks,
A

1 Like