Potential Error in /docker/config_template.yaml

I was having an issue in my environment with the worker service not being able to connect properly to the frontend service to dispatch its scanning workflow. I found this was because it lacked the correct setting for publicClient->hostPort in the yaml file.

In my environment we use TLS on an ALB in front of frontend, so I went looking for the TLS settings, this line in the dockerized yaml:

indicates that the TLS settings should be under:
global->tls->frontend->client->rootCaFiles

but looking at the config.go code, I see the global->tls->systemWorker->client->rootCaFiles which seems to be more correct.

Is there a good reason to use one over the other?

Moreover, looking at it I am not sure the global->frontend->client->rootCaFiles path in the yaml actually works for the worker. I was not able to get the worker to leverage those TLS settings.

global->tls->systemWorker is the right way to configure system workers. It was added in Add explicit TLS settings for system workers while also supporting legacy config by sergeybykov · Pull Request #1059 · temporalio/temporal · GitHub to be more explicit than using frontend’s client settings for that. We left the old functionality intact to keep the code backward compatible. We need to update config_template.yaml to promote usage of SystemWorker.

Thank you for the fast reply!

That’s what I gathered from the change, and I’m very glad the change was made or I would have difficulty in our setup. However, when I try to replace the config to make it work, I am experiencing strange behavior (still on version 1.8.1 if that matters).
I continue to get connection errors with connection closed as the explanation. I believe this is due to it not communicating TLS to the Frontend service’s ALB. I tried various settings I believe to be correct, same certs and hostname verifications etc. I use with tctl and the web service, however the connection closed errors persist. I have now tried just giving it complete garbage everywhere, including providing both a rootCAFiles and a rootCAData and it still only complains about connection closed, and it is not hitting the validation that those two things are both provided. I also would have thought that it would barf at non-existent files. I am wondering if there is some setting I am missing that is causing it to ignore TLS config.

Here is the full config, full of garbage I am using, which I extracted direct from logs in debug mode to confirm it is using this data:

"archival:
    history:
        enableRead: true
        provider:
            filestore:
                dirMode: \"0766\"
                fileMode: \"0666\"
            gstorage: null
            s3store: null
        state: enabled
    visibility:
        enableRead: true
        provider:
            filestore:
                dirMode: \"0766\"
                fileMode: \"0666\"
            gstorage: null
            s3store: null
        state: enabled
clusterMetadata:
    clusterInformation:
        active:
            enabled: true
            initialFailoverVersion: 1
            rpcAddress: 127.0.0.1:7233
            rpcName: frontend
    currentClusterName: active
    enableGlobalNamespace: false
    failoverVersionIncrement: 10
    masterClusterName: active
dcRedirectionPolicy:
    policy: noop
    toDC: \"\"
dynamicConfigClient:
    filepath: config/dynamicconfig/development.yaml
    pollInterval: 1m0s
global:
    authorization:
        authorizer: \"\"
        claimMapper: \"\"
        jwtKeyProvider:
            keySourceURIs:
                - \"\"
                - \"\"
            refreshInterval: 1m0s
        permissionsClaimName: permissions
    membership:
        broadcastAddress: \"\"
        maxJoinDuration: 30s
    metrics:
        m3: null
        otprometheus: null
        prefix: \"\"
        prometheus: null
        statsd:
            flushBytes: 0
            flushInterval: 0s
            hostPort: 127.0.0.1:8125
            prefix: temporal
        tags: {}
    pprof:
        port: 0
    tls:
        expirationChecks:
            checkInterval: 0s
            errorWindow: 0s
            warningWindow: 0s
        frontend:
            client:
                disableHostVerification: true
                rootCaData:
                    - fooo
                rootCaFiles:
                    - foo
                serverName: bananans
            hostOverrides: {}
            server:
                certData: \"\"
                certFile: \"\"
                clientCaData:
                    - \"\"
                    - \"\"
                clientCaFiles:
                    - \"\"
                    - \"\"
                keyData: '******'
                keyFile: \"\"
                requireClientAuth: false
        internode:
            client:
                disableHostVerification: false
                rootCaData:
                    - fooo
                rootCaFiles:
                    - foo
                serverName: \"\"
            hostOverrides: {}
            server:
                certData: \"\"
                certFile: \"\"
                clientCaData:
                    - fooo
                clientCaFiles:
                    - foo
                keyData: '******'
                keyFile: \"\"
                requireClientAuth: false
        systemWorker:
            certData: \"\"
            certFile: \"\"
            client:
                disableHostVerification: true
                rootCaData:
                    - 643$
                rootCaFiles:
                    - /foo
                serverName: bananans
            keyData: '******'
            keyFile: \"\"
log:
    level: debug
    outputFile: \"\"
    stdout: true
namespaceDefaults:
    archival:
        history:
            URI: file:///tmp/temporal_archival/development
            state: disabled
        visibility:
            URI: file:///tmp/temporal_vis_archival/development
            state: disabled
persistence:
    advancedVisibilityStore: \"\"
    datastores:
        default:
            cassandra:
                connectTimeout: 0s
                consistency: null
                datacenter: \"\"
                hosts: 10.22.2.75,10.22.7.73,10.22.9.193
                keyspace: temporal
                maxConns: 20
                password: '******'
                port: 9042
                tls:
                    caData: \"\"
                    caFile: /conf/instaclustr-ca-cert.pem
                    certData: \"\"
                    certFile: \"\"
                    enableHostVerification: false
                    enabled: true
                    keyData: '******'
                    keyFile: \"\"
                    serverName: \"\"
                user: iccassandra
            customDatastore: null
            elasticsearch: null
            sql: null
        visibility:
            cassandra:
                connectTimeout: 0s
                consistency: null
                datacenter: \"\"
                hosts: 10.22.2.75,10.22.7.73,10.22.9.193
                keyspace: temporal_visibility
                maxConns: 10
                password: '******'
                port: 9042
                tls:
                    caData: \"\"
                    caFile: /conf/instaclustr-ca-cert.pem
                    certData: \"\"
                    certFile: \"\"
                    enableHostVerification: false
                    enabled: true
                    keyData: '******'
                    keyFile: \"\"
                    serverName: \"\"
                user: iccassandra
            customDatastore: null
            elasticsearch: null
            sql: null
    defaultStore: default
    numHistoryShards: 1024
    visibilityStore: visibility
publicClient:
    RefreshInterval: 0s
    hostPort: temporal-frontend.dev.icprivate.com:443
services:
    frontend:
        metrics:
            m3: null
            otprometheus: null
            prefix: \"\"
            prometheus: null
            statsd: null
            tags: {}
        rpc:
            bindOnIP: 127.0.0.1
            bindOnLocalHost: false
            grpcPort: 7233
            membershipPort: 6933
    history:
        metrics:
            m3: null
            otprometheus: null
            prefix: \"\"
            prometheus: null
            statsd: null
            tags: {}
        rpc:
            bindOnIP: 127.0.0.1
            bindOnLocalHost: false
            grpcPort: 7234
            membershipPort: 6934
    matching:
        metrics:
            m3: null
            otprometheus: null
            prefix: \"\"
            prometheus: null
            statsd: null
            tags: {}
        rpc:
            bindOnIP: 127.0.0.1
            bindOnLocalHost: false
            grpcPort: 7235
            membershipPort: 6935
    worker:
        metrics:
            m3: null
            otprometheus: null
            prefix: \"\"
            prometheus: null
            statsd: null
            tags: {}
        rpc:
            bindOnIP: 127.0.0.1
            bindOnLocalHost: false
            grpcPort: 7239
            membershipPort: 6939
"

For reference this is the error message I receive:

{"level":"warn","ts":"2021-06-11T05:27:27.087Z","msg":"unable to verify if namespace exist","Namespace":"temporal-system","TaskQueue":"temporal-sys-history-scanner-taskqueue-0","WorkerID":"2032@ip-10-210-122-193.ec2.internal@","WorkerType":"ActivityWorker","Namespace":"temporal-system","Error":"last connection error: connection closed","logging-call-at":"internal_worker.go:258"}

I should also mention, I can telnet from the worker service container to the ALB directly. So I do not believe there is a different network issue in the way.

@SergeyBykov, I was able to overcome the issue by adding in a custom TLS provider which set the tls.Config directly to the same tls.Config I leverage in my usage of the SDK.

I would much rather use the configuration as expected by the default provider, but I couldn’t make it work correctly when given the same rootCACert I use in my custom provider. I also could not make it error when I gave it paths to non-existent files or clearly erroneous base64 encoded data, which was concerning.

I believe the root of my problem is that I run the worker as a separate service and it has to communicate to the frontend via the ALB with TLS, but behind the ALB frontend is running without TLS enabled.

One thing that struck me is that it feels odd that the worker service would run leveraging the PublicClient configuration and not derive its knowledge of where Frontend is over Ringpop like the other services appear to do. Is there a particular reason the worker service gets to the Frontend via PublicClient config, unlike the other services?

I’m trying to wrap my head around your setup. Do I understand it correctly that you run Temporal Server services (history, matching, frontend, and system workers) separately with a load balancer in front of the frontend service? And there’s no TLS configured on the frontend, only on the load balancer? If there’s no TLS behind the load balancer, can’t the worker service also run there and connect to the frontend without TLS?

From the worker service side, can you do openssl s_client -connect <frontend's address>:7233 -showcerts to see what server cert it is receiving? The connection gets dropped because either the worker service is unable to validate the server cert or the cert it passes to the frontend doesn’t match the client CA cert configured there.

One thing that struck me is that it feels odd that the worker service would run leveraging the PublicClient configuration and not derive its knowledge of where Frontend is over Ringpop like the other services appear to do. Is there a particular reason the worker service gets to the Frontend via PublicClient config, unlike the other services?

The worker service acts as a true client. I think it’s actually good that it doesn’t have a ‘backdoor’ into the system. In fact, the legacy TLS configuration before global->tls->systemWorker is an example of tight coupling that only creates problems.

Yes, your understanding is correct. This is a high level diagram of how the system is deployed.

I was under the impression from the docs that the frontend service may also reach out directly to the worker, which is why it participates in the service discovery, but as far as I can tell that communication is working fine and only one way.

I can the openssl command from the worker, and got the Amazon issued SSL cert I expected. This makes sense, because with my custom TLS factory everything works fine. All my custom factory does is return the Amazon CA cert, same exact CA cert I was trying to pass on the path.

package main

import (
	"crypto/tls"
	"crypto/x509"
	_ "embed"
)

type StaticTLSConfigProvider struct {
}

func (s *StaticTLSConfigProvider) GetInternodeClientConfig() (*tls.Config, error) {
	return getEmptyTLSConfig(), nil
}

func (s *StaticTLSConfigProvider) GetFrontendClientConfig() (*tls.Config, error) {
	return GetTlsConfig(), nil
}

func (s *StaticTLSConfigProvider) GetFrontendServerConfig() (*tls.Config, error) {
	return getEmptyTLSConfig(), nil
}

func (s *StaticTLSConfigProvider) GetInternodeServerConfig() (*tls.Config, error) {
	return getEmptyTLSConfig(), nil
}

func getEmptyTLSConfig() *tls.Config {
	return nil
}

func GetTlsConfig() *tls.Config {
	rootPEM := awsCACert
	roots := x509.NewCertPool()
	ok := roots.AppendCertsFromPEM([]byte(rootPEM))
	if !ok {
		panic("Failed to parse certificates")
	}
	return &tls.Config{
		RootCAs: roots,
	}
}

//go:embed public_aws_ca_cert.pem
var awsCACert string

Agreed on it being a good thing the worker acts just like any other worker out there. The only part I was confused by is why it participates in service discovery at all, but re-reading the docs it sounds like other services will command it which is why it still participates.

My system is working fine under this custom provider configuration, so I am doing okay. It is a little clunky. The TLSFactory needs to return not-ssl for the other services, so I made some inelegant code that only applies the factory on the worker service since the interface for the factory didn’t have that boolean for “isWorker” that I saw leveraged inside the codebase.
The part of this that still concerns me was that I couldn’t make it crash. I saw the code for checking if both data and file path is provided as well as the io reads on the file path. When I fed it misconfiguration I would have expected to see some logs that it couldn’t parse the certs or an IO panic because the file doesn’t exist etc.