Potential Error in /docker/config_template.yaml

I was having an issue in my environment with the worker service not being able to connect properly to the frontend service to dispatch its scanning workflow. I found this was because it lacked the correct setting for publicClient->hostPort in the yaml file.

In my environment we use TLS on an ALB in front of frontend, so I went looking for the TLS settings, this line in the dockerized yaml:

indicates that the TLS settings should be under:
global->tls->frontend->client->rootCaFiles

but looking at the config.go code, I see the global->tls->systemWorker->client->rootCaFiles which seems to be more correct.

Is there a good reason to use one over the other?

Moreover, looking at it I am not sure the global->frontend->client->rootCaFiles path in the yaml actually works for the worker. I was not able to get the worker to leverage those TLS settings.

global->tls->systemWorker is the right way to configure system workers. It was added in Add explicit TLS settings for system workers while also supporting legacy config by sergeybykov · Pull Request #1059 · temporalio/temporal · GitHub to be more explicit than using frontend’s client settings for that. We left the old functionality intact to keep the code backward compatible. We need to update config_template.yaml to promote usage of SystemWorker.

Thank you for the fast reply!

That’s what I gathered from the change, and I’m very glad the change was made or I would have difficulty in our setup. However, when I try to replace the config to make it work, I am experiencing strange behavior (still on version 1.8.1 if that matters).
I continue to get connection errors with connection closed as the explanation. I believe this is due to it not communicating TLS to the Frontend service’s ALB. I tried various settings I believe to be correct, same certs and hostname verifications etc. I use with tctl and the web service, however the connection closed errors persist. I have now tried just giving it complete garbage everywhere, including providing both a rootCAFiles and a rootCAData and it still only complains about connection closed, and it is not hitting the validation that those two things are both provided. I also would have thought that it would barf at non-existent files. I am wondering if there is some setting I am missing that is causing it to ignore TLS config.

Here is the full config, full of garbage I am using, which I extracted direct from logs in debug mode to confirm it is using this data:

"archival:
    history:
        enableRead: true
        provider:
            filestore:
                dirMode: \"0766\"
                fileMode: \"0666\"
            gstorage: null
            s3store: null
        state: enabled
    visibility:
        enableRead: true
        provider:
            filestore:
                dirMode: \"0766\"
                fileMode: \"0666\"
            gstorage: null
            s3store: null
        state: enabled
clusterMetadata:
    clusterInformation:
        active:
            enabled: true
            initialFailoverVersion: 1
            rpcAddress: 127.0.0.1:7233
            rpcName: frontend
    currentClusterName: active
    enableGlobalNamespace: false
    failoverVersionIncrement: 10
    masterClusterName: active
dcRedirectionPolicy:
    policy: noop
    toDC: \"\"
dynamicConfigClient:
    filepath: config/dynamicconfig/development.yaml
    pollInterval: 1m0s
global:
    authorization:
        authorizer: \"\"
        claimMapper: \"\"
        jwtKeyProvider:
            keySourceURIs:
                - \"\"
                - \"\"
            refreshInterval: 1m0s
        permissionsClaimName: permissions
    membership:
        broadcastAddress: \"\"
        maxJoinDuration: 30s
    metrics:
        m3: null
        otprometheus: null
        prefix: \"\"
        prometheus: null
        statsd:
            flushBytes: 0
            flushInterval: 0s
            hostPort: 127.0.0.1:8125
            prefix: temporal
        tags: {}
    pprof:
        port: 0
    tls:
        expirationChecks:
            checkInterval: 0s
            errorWindow: 0s
            warningWindow: 0s
        frontend:
            client:
                disableHostVerification: true
                rootCaData:
                    - fooo
                rootCaFiles:
                    - foo
                serverName: bananans
            hostOverrides: {}
            server:
                certData: \"\"
                certFile: \"\"
                clientCaData:
                    - \"\"
                    - \"\"
                clientCaFiles:
                    - \"\"
                    - \"\"
                keyData: '******'
                keyFile: \"\"
                requireClientAuth: false
        internode:
            client:
                disableHostVerification: false
                rootCaData:
                    - fooo
                rootCaFiles:
                    - foo
                serverName: \"\"
            hostOverrides: {}
            server:
                certData: \"\"
                certFile: \"\"
                clientCaData:
                    - fooo
                clientCaFiles:
                    - foo
                keyData: '******'
                keyFile: \"\"
                requireClientAuth: false
        systemWorker:
            certData: \"\"
            certFile: \"\"
            client:
                disableHostVerification: true
                rootCaData:
                    - 643$
                rootCaFiles:
                    - /foo
                serverName: bananans
            keyData: '******'
            keyFile: \"\"
log:
    level: debug
    outputFile: \"\"
    stdout: true
namespaceDefaults:
    archival:
        history:
            URI: file:///tmp/temporal_archival/development
            state: disabled
        visibility:
            URI: file:///tmp/temporal_vis_archival/development
            state: disabled
persistence:
    advancedVisibilityStore: \"\"
    datastores:
        default:
            cassandra:
                connectTimeout: 0s
                consistency: null
                datacenter: \"\"
                hosts: 10.22.2.75,10.22.7.73,10.22.9.193
                keyspace: temporal
                maxConns: 20
                password: '******'
                port: 9042
                tls:
                    caData: \"\"
                    caFile: /conf/instaclustr-ca-cert.pem
                    certData: \"\"
                    certFile: \"\"
                    enableHostVerification: false
                    enabled: true
                    keyData: '******'
                    keyFile: \"\"
                    serverName: \"\"
                user: iccassandra
            customDatastore: null
            elasticsearch: null
            sql: null
        visibility:
            cassandra:
                connectTimeout: 0s
                consistency: null
                datacenter: \"\"
                hosts: 10.22.2.75,10.22.7.73,10.22.9.193
                keyspace: temporal_visibility
                maxConns: 10
                password: '******'
                port: 9042
                tls:
                    caData: \"\"
                    caFile: /conf/instaclustr-ca-cert.pem
                    certData: \"\"
                    certFile: \"\"
                    enableHostVerification: false
                    enabled: true
                    keyData: '******'
                    keyFile: \"\"
                    serverName: \"\"
                user: iccassandra
            customDatastore: null
            elasticsearch: null
            sql: null
    defaultStore: default
    numHistoryShards: 1024
    visibilityStore: visibility
publicClient:
    RefreshInterval: 0s
    hostPort: temporal-frontend.dev.icprivate.com:443
services:
    frontend:
        metrics:
            m3: null
            otprometheus: null
            prefix: \"\"
            prometheus: null
            statsd: null
            tags: {}
        rpc:
            bindOnIP: 127.0.0.1
            bindOnLocalHost: false
            grpcPort: 7233
            membershipPort: 6933
    history:
        metrics:
            m3: null
            otprometheus: null
            prefix: \"\"
            prometheus: null
            statsd: null
            tags: {}
        rpc:
            bindOnIP: 127.0.0.1
            bindOnLocalHost: false
            grpcPort: 7234
            membershipPort: 6934
    matching:
        metrics:
            m3: null
            otprometheus: null
            prefix: \"\"
            prometheus: null
            statsd: null
            tags: {}
        rpc:
            bindOnIP: 127.0.0.1
            bindOnLocalHost: false
            grpcPort: 7235
            membershipPort: 6935
    worker:
        metrics:
            m3: null
            otprometheus: null
            prefix: \"\"
            prometheus: null
            statsd: null
            tags: {}
        rpc:
            bindOnIP: 127.0.0.1
            bindOnLocalHost: false
            grpcPort: 7239
            membershipPort: 6939
"

For reference this is the error message I receive:

{"level":"warn","ts":"2021-06-11T05:27:27.087Z","msg":"unable to verify if namespace exist","Namespace":"temporal-system","TaskQueue":"temporal-sys-history-scanner-taskqueue-0","WorkerID":"2032@ip-10-210-122-193.ec2.internal@","WorkerType":"ActivityWorker","Namespace":"temporal-system","Error":"last connection error: connection closed","logging-call-at":"internal_worker.go:258"}

I should also mention, I can telnet from the worker service container to the ALB directly. So I do not believe there is a different network issue in the way.

@SergeyBykov, I was able to overcome the issue by adding in a custom TLS provider which set the tls.Config directly to the same tls.Config I leverage in my usage of the SDK.

I would much rather use the configuration as expected by the default provider, but I couldn’t make it work correctly when given the same rootCACert I use in my custom provider. I also could not make it error when I gave it paths to non-existent files or clearly erroneous base64 encoded data, which was concerning.

I believe the root of my problem is that I run the worker as a separate service and it has to communicate to the frontend via the ALB with TLS, but behind the ALB frontend is running without TLS enabled.

One thing that struck me is that it feels odd that the worker service would run leveraging the PublicClient configuration and not derive its knowledge of where Frontend is over Ringpop like the other services appear to do. Is there a particular reason the worker service gets to the Frontend via PublicClient config, unlike the other services?

I’m trying to wrap my head around your setup. Do I understand it correctly that you run Temporal Server services (history, matching, frontend, and system workers) separately with a load balancer in front of the frontend service? And there’s no TLS configured on the frontend, only on the load balancer? If there’s no TLS behind the load balancer, can’t the worker service also run there and connect to the frontend without TLS?

From the worker service side, can you do openssl s_client -connect <frontend's address>:7233 -showcerts to see what server cert it is receiving? The connection gets dropped because either the worker service is unable to validate the server cert or the cert it passes to the frontend doesn’t match the client CA cert configured there.

One thing that struck me is that it feels odd that the worker service would run leveraging the PublicClient configuration and not derive its knowledge of where Frontend is over Ringpop like the other services appear to do. Is there a particular reason the worker service gets to the Frontend via PublicClient config, unlike the other services?

The worker service acts as a true client. I think it’s actually good that it doesn’t have a ‘backdoor’ into the system. In fact, the legacy TLS configuration before global->tls->systemWorker is an example of tight coupling that only creates problems.

Yes, your understanding is correct. This is a high level diagram of how the system is deployed.

I was under the impression from the docs that the frontend service may also reach out directly to the worker, which is why it participates in the service discovery, but as far as I can tell that communication is working fine and only one way.

I can the openssl command from the worker, and got the Amazon issued SSL cert I expected. This makes sense, because with my custom TLS factory everything works fine. All my custom factory does is return the Amazon CA cert, same exact CA cert I was trying to pass on the path.

package main

import (
	"crypto/tls"
	"crypto/x509"
	_ "embed"
)

type StaticTLSConfigProvider struct {
}

func (s *StaticTLSConfigProvider) GetInternodeClientConfig() (*tls.Config, error) {
	return getEmptyTLSConfig(), nil
}

func (s *StaticTLSConfigProvider) GetFrontendClientConfig() (*tls.Config, error) {
	return GetTlsConfig(), nil
}

func (s *StaticTLSConfigProvider) GetFrontendServerConfig() (*tls.Config, error) {
	return getEmptyTLSConfig(), nil
}

func (s *StaticTLSConfigProvider) GetInternodeServerConfig() (*tls.Config, error) {
	return getEmptyTLSConfig(), nil
}

func getEmptyTLSConfig() *tls.Config {
	return nil
}

func GetTlsConfig() *tls.Config {
	rootPEM := awsCACert
	roots := x509.NewCertPool()
	ok := roots.AppendCertsFromPEM([]byte(rootPEM))
	if !ok {
		panic("Failed to parse certificates")
	}
	return &tls.Config{
		RootCAs: roots,
	}
}

//go:embed public_aws_ca_cert.pem
var awsCACert string

Agreed on it being a good thing the worker acts just like any other worker out there. The only part I was confused by is why it participates in service discovery at all, but re-reading the docs it sounds like other services will command it which is why it still participates.

My system is working fine under this custom provider configuration, so I am doing okay. It is a little clunky. The TLSFactory needs to return not-ssl for the other services, so I made some inelegant code that only applies the factory on the worker service since the interface for the factory didn’t have that boolean for “isWorker” that I saw leveraged inside the codebase.
The part of this that still concerns me was that I couldn’t make it crash. I saw the code for checking if both data and file path is provided as well as the io reads on the file path. When I fed it misconfiguration I would have expected to see some logs that it couldn’t parse the certs or an IO panic because the file doesn’t exist etc.

I was under the impression from the docs that the frontend service may also reach out directly to the worker, which is why it participates in the service discovery, but as far as I can tell that communication is working fine and only one way.

Yes, it’s only one way - worker service calling frontend. Hence, there shouldn’t be an arrow from frontend to worker service on the diagram above.

All my custom factory does is return the Amazon CA cert, same exact CA cert I was trying to pass on the path.

I don’t understand why this is a CA cert, and not a leaf one. Or do you mean a leaf cert signed by Amazon’s CA?

The only part I was confused by is why it participates in service discovery at all, but re-reading the docs it sounds like other services will command it which is why it still participates.

I believe this is more for scaling it out - to run multiple instances (pods in k8s) of the worker service. Not for sending any commands directly to them.

The TLSFactory needs to return not-ssl for the other services, so I made some inelegant code that only applies the factory on the worker service since the interface for the factory didn’t have that boolean for “isWorker” that I saw leveraged inside the codebase.

Were you not able to configure the worker service with a static config file with only TLS.SystemWorker piece filled and a separate no-TLS config file for the history/matching/frontend services?

The part of this that still concerns me was that I couldn’t make it crash. I saw the code for checking if both data and file path is provided as well as the io reads on the file path. When I fed it misconfiguration I would have expected to see some logs that it couldn’t parse the certs or an IO panic because the file doesn’t exist etc.

I need to check that. The verification code should be there.

I don’t understand why this is a CA cert, and not a leaf one. Or do you mean a leaf cert signed by Amazon’s CA?

Perhaps this where I am going wrong. All I am trying to pass is the RootCA cert for validating the frontend service’s ALB’s server cert. I do not use any mutual TLS auth or anything of that nature.

I believe this is more for scaling it out - to run multiple instances (pods in k8s) of the worker service.
Not for sending any commands directly to them.

Oh so the worker coordinates with the other workers, effectively talking to itself? Thank you, that’s good to know. I noticed all services seem to listen to the ring change events of all other services, which lead me to think they had more communication than evidently they do.

Were you not able to configure the worker service with a static config file with only TLS.SystemWorker piece filled and a separate no-TLS config file for the history/matching/frontend services?

No I was not. I set only the hostname validation and rootCA portion of the system worker config. What I saw in the log was a “last connection error: connection closed” which I have typically seen as a result of trying to talk HTTP when the server expects HTTPS.
Here is the dockerize template I was using (I set the env vars to the same rootCA cert I use in the static config):

        systemWorker:
            certFile:
            keyFile:
            certData:
            keyData:
            client:
                serverName: {{ default .Env.TEMPORAL_TLS_SERVER_NAME "" }}
                disableHostVerification: {{ default .Env.TEMPORAL_TLS_FRONTEND_DISABLE_HOST_VERIFICATION "false"}}
                rootCaFiles:
                    - {{ default .Env.TEMPORAL_TLS_CA_PATH "" }}
                rootCaData:
                    - {{ default .Env.TEMPORAL_TLS_CA_CERT_DATA "" }}

One thing to note. I am still on 1.8.1. I am working on the process for no downtime upgrades, and have not been upgrading so that I can use the most recent updates to test it.

Perhaps this where I am going wrong. All I am trying to pass is the RootCA cert for validating the frontend service’s ALB’s server cert. I do not use any mutual TLS auth or anything of that nature.

If you are not using mTLS, then never mind. I wrongly assumed mTLS here.

Oh so the worker coordinates with the other workers, effectively talking to itself? Thank you, that’s good to know. I noticed all services seem to listen to the ring change events of all other services, which lead me to think they had more communication than evidently they do.

I don’t believe they even coordinate. I think that’s just the artifact of reusing the same service wrapper code. But I can double-check.

No I was not. I set only the hostname validation and rootCA portion of the system worker config. What I saw in the log was a “last connection error: connection closed” which I have typically seen as a result of trying to talk HTTP when the server expects HTTPS.

This is very strange. As if it’s failing to validate the ALB cert/server name. Does the ALB name include that server name in it? What happens if you disable host verification? Does it connect successfully?

One thing to note. I am still on 1.8.1. I am working on the process for no downtime upgrades, and have not been upgrading so that I can use the most recent updates to test it.

That’s fine. This code hasn’t changed in a while. 1.9.0 added a couple of relevant features:

  • Periodic refreshing of TLS certificates from files (#1415)
  • Ability to inject TLS certificate provider (#1391)

I tried both with and without hostname verification in the configuration case without any luck.
I use the ALB’s name for the hostname verification. That same configuration works fine with tctl and the node webserver (Temporal UI). I haven’t enabled hostname verification on my golang/ruby sdk based workers yet, but otherwise the config is the same there.
My ultimate solution was just to use the same TLS config I give to my golang/ruby SDK workers via a ServerOption TLS factory. With that config provided by plugin everything started working fine.
I intend to add hostname verification to that config soonish, just keeps getting pushed down the backlog.

There was a relevant PR submitted. Are you involved in it by chance?

As I was digging through the code, I realized that my memory had failed me. Worker Service is supposed to connect to the internode endpoint, not to the frontend. Hence, it shouldn’t be going through the ALB in your case.

Sadly not! I wish I could have traced the issue properly but I couldn’t devote the time.

Interesting that it is supposed to connect to internode. It appeared to me that it always communicated via the PublicClient configured endpoint, which is why I put my TLS enabled ALB under PublicClient.

Is there a way to annotate the internode address in PublicClient? I was worried I’d need a special DNS config that returned all addresses in order to use it that way. While understandably very common for places that focus on GRPC, I don’t have access to that kind of DNS infra/lookaside LB at my location.

I submitted a modified version of that PR - Fix enabled flag for TLS-FrontendClientConfig by sergeybykov ¡ Pull Request #1678 ¡ temporalio/temporal ¡ GitHub.

Very cool. I don’t have a lab environment that’s quite the same as my dev/prod. I will validate I can remove my workaround after this fix is released as part of my upgrade cycle.

Thank you for the continuous followup this has been really amazing!

1 Like