Unable to execute workflow context deadline exceeded` after setting up mtls

I have enabled mtls for both internode and front-end, but now when I try to start a workflow, I get a Unable to execute workflow context deadline exceeded error. This error is not very informative, and I’m not really sure how to troubleshoot from here - Is there a way to get more verbose errors, or to diagnose the problem further?

  • Im using the go sdk, running on my local workstation, connecting to the front-end over a kubectl port-forwarded connection.
  • Im able to run some tctl commands (for example tctl wf l) using the same cert, key, and root ca
  • The logs on the worker pod (which is also running in k8s) show similar errors:
{"level":"error","ts":"2021-10-08T18:13:07.173Z","msg":"error starting temporal-sys-history-scanner-workflow workflow","service":"worker","error":"context deadline exceeded","logging-call-at":"scanner.go:191","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:143\ngo.temporal.io/server/service/worker/scanner.(*Scanner).startWorkflow\n\t/temporal/service/worker/scanner/scanner.go:191\ngo.temporal.io/server/service/worker/scanner.(*Scanner).startWorkflowWithRetry.func1\n\t/temporal/service/worker/scanner/scanner.go:168\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:103\ngo.temporal.io/server/service/worker/scanner.(*Scanner).startWorkflowWithRetry\n\t/temporal/service/worker/scanner/scanner.go:167"}
1 Like

The fact that the error is coming from scanner makes me think that system workers are unable to connect to the frontends. Can you share how exactly you configured TLS?

Thanks for helping me out @SergeyBykov. I deployed and modified the helm chart, and the relevant section of the temporal-config is:

tls:
  internode:
      server:
          requireClientAuth: true
          certFile: /certs/tls.crt
          keyFile: /certs/tls.key
          clientCaFiles:
            - /certs/ca.crt
      client:
          serverName: temporal-frontend.temporal.svc.cluster.local
          rootCaFiles:
            - /certs/ca.crt
  frontend:
      server:
          requireClientAuth: true
          certFile: /certs/tls.crt
          keyFile: /certs/tls.key
          clientCaFiles:
            - /certs/ca.crt
      client:
          serverName: temporal-frontend.temporal.svc.cluster.local
          rootCaFiles:
            - /certs/ca.crt

I realize that the internode and front-end certs ultimately should be different, but im just trying to get it basically working first…

I have created a private CA issuer in cert manager, and for each server, generated a certificate signed by that issuer. Each server then has the relevant secret mapped to the pods in ‘/certs’, where tls.cert is the cert, tls.key is the private key, and ca.crt is the ca cert. I verified that these are mapped to the worker, and the following command runs properly from the worker:

tctl --tls_cert_path /certs/tls.crt --tls_key_path /certs/tls.key --tls_ca_path /certs/ca.crt --address temporal-frontend.temporal.svc.cluster.local:7233 wf list

Try adding an explicit systemWorker section similar to this example. That should configure system worker (which includes scanner) to connect to the frontend.

That didn’t seem to resolve the errors. Is there anything i can do to get more detailed error messages? Is there an environment variable to enable verbose logging or anything?

TLS issues are notoriously difficult to debug. I’m unaware of any shortcut here.
If I change TLS configuration here to

tls:
        internode:
            server:
                requireClientAuth: true
                certFile: /etc/temporal/config/certs/cluster/internode/cluster-internode.pem
                keyFile: /etc/temporal/config/certs/cluster/internode/cluster-internode.key
                clientCaFiles:
                    - /etc/temporal/config/certs/cluster/ca/server-intermediate-ca.pem
            client:
                serverName: internode.cluster-x.contoso.com
                rootCaFiles:
                    - /etc/temporal/config/certs/cluster/ca/server-intermediate-ca.pem
        frontend:
            server:
                requireClientAuth: true
                certFile: /etc/temporal/config/certs/cluster/internode/cluster-internode.pem
                keyFile: /etc/temporal/config/certs/cluster/internode/cluster-internode.key
                clientCaFiles:
                    - /etc/temporal/config/certs/cluster/ca/server-intermediate-ca.pem            
        systemWorker:
            certFile: /etc/temporal/config/certs/cluster/internode/cluster-internode.pem
            keyFile: /etc/temporal/config/certs/cluster/internode/cluster-internode.key
            client:
                serverName: internode.cluster-x.contoso.com
                rootCaFiles:
                    - /etc/temporal/config/certs/cluster/ca/server-intermediate-ca.pem

I’m able to start Temporal with bash start-temporal.sh (after generating certs with bash generate-cert.sh), and I see in the server log

"temporal-sys-history-scanner-workflow workflow successfully started","service":"worker","logging-call-at":"scanner.go:186"

If I remove the systemWorker: section, Temporal is failing to start with the following error in the log, as expected.

"error starting scanner","service":"worker","error":"context deadline exceeded","logging-call-at":"service.go:242"

That’s why I suggested above to add a systemWorker: section to your config.
I wonder what’s the delta between my config and yours. I’m not seeing any. Even if I add (unnecessary with systemWorker:) client: section within frontend:, Temporal still starts fine for me.

Did you get this to work? I also enable MTLS and see the workers crashing with “error starting scanner”/“context deadline exceeded”. I only configured the server frontend (no internode), similar to

tls:
  frontend:
      server:
          requireClientAuth: true
          certFile: /certs/tls.crt
          keyFile: /certs/tls.key
          clientCaFiles:
            - /certs/ca.crt

I did get it to work - found some important info here: Configure the Temporal Server | Temporal Documentation


Note: In the case that client authentication is enabled, the internode.server certificate is used as the client certificate among services. This adds the following requirements:

The internode.server certificate must be specified on all roles, even for a frontend-only configuration.

Internode server certificates must be minted with either no Extended Key Usages or both ServerAuth and ClientAuth EKUs.

If your Certificate Authorities are untrusted, such as in the previous example, the internode server Ca will need to be specified in the following places:

internode.server.clientCaFiles

internode.client.rootCaFiles

frontend.server.clientCaFiles

and also this (cant remember where i saw is):


`client.serverName` - The server name that is validated against the server's certificate. Because Temporal connects via IP addresses and the ip addresses are ephemeral in kubernetes, we MUST set this value and it MUST match a name in the DNS section of the certificates for the relevant services.

In my case, I had to add a DNS name in my certs for the history and matching servers that matched the internode.client.serverName setting. Hope this helps…

2 Likes

Thank you so much! That helps a great deal and I suspect you saved me quite a bit of time. I think I recall reading that section now.