`Unable to execute workflow context deadline exceeded` after setting up mtls

I have enabled mtls for both internode and front-end, but now when I try to start a workflow, I get a Unable to execute workflow context deadline exceeded error. This error is not very informative, and I’m not really sure how to troubleshoot from here - Is there a way to get more verbose errors, or to diagnose the problem further?

  • Im using the go sdk, running on my local workstation, connecting to the front-end over a kubectl port-forwarded connection.
  • Im able to run some tctl commands (for example tctl wf l) using the same cert, key, and root ca
  • The logs on the worker pod (which is also running in k8s) show similar errors:
{"level":"error","ts":"2021-10-08T18:13:07.173Z","msg":"error starting temporal-sys-history-scanner-workflow workflow","service":"worker","error":"context deadline exceeded","logging-call-at":"scanner.go:191","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:143\ngo.temporal.io/server/service/worker/scanner.(*Scanner).startWorkflow\n\t/temporal/service/worker/scanner/scanner.go:191\ngo.temporal.io/server/service/worker/scanner.(*Scanner).startWorkflowWithRetry.func1\n\t/temporal/service/worker/scanner/scanner.go:168\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:103\ngo.temporal.io/server/service/worker/scanner.(*Scanner).startWorkflowWithRetry\n\t/temporal/service/worker/scanner/scanner.go:167"}

The fact that the error is coming from scanner makes me think that system workers are unable to connect to the frontends. Can you share how exactly you configured TLS?

Thanks for helping me out @SergeyBykov. I deployed and modified the helm chart, and the relevant section of the temporal-config is:

tls:
  internode:
      server:
          requireClientAuth: true
          certFile: /certs/tls.crt
          keyFile: /certs/tls.key
          clientCaFiles:
            - /certs/ca.crt
      client:
          serverName: temporal-frontend.temporal.svc.cluster.local
          rootCaFiles:
            - /certs/ca.crt
  frontend:
      server:
          requireClientAuth: true
          certFile: /certs/tls.crt
          keyFile: /certs/tls.key
          clientCaFiles:
            - /certs/ca.crt
      client:
          serverName: temporal-frontend.temporal.svc.cluster.local
          rootCaFiles:
            - /certs/ca.crt

I realize that the internode and front-end certs ultimately should be different, but im just trying to get it basically working first…

I have created a private CA issuer in cert manager, and for each server, generated a certificate signed by that issuer. Each server then has the relevant secret mapped to the pods in ‘/certs’, where tls.cert is the cert, tls.key is the private key, and ca.crt is the ca cert. I verified that these are mapped to the worker, and the following command runs properly from the worker:

tctl --tls_cert_path /certs/tls.crt --tls_key_path /certs/tls.key --tls_ca_path /certs/ca.crt --address temporal-frontend.temporal.svc.cluster.local:7233 wf list

Try adding an explicit systemWorker section similar to this example. That should configure system worker (which includes scanner) to connect to the frontend.

That didn’t seem to resolve the errors. Is there anything i can do to get more detailed error messages? Is there an environment variable to enable verbose logging or anything?

TLS issues are notoriously difficult to debug. I’m unaware of any shortcut here.
If I change TLS configuration here to

tls:
        internode:
            server:
                requireClientAuth: true
                certFile: /etc/temporal/config/certs/cluster/internode/cluster-internode.pem
                keyFile: /etc/temporal/config/certs/cluster/internode/cluster-internode.key
                clientCaFiles:
                    - /etc/temporal/config/certs/cluster/ca/server-intermediate-ca.pem
            client:
                serverName: internode.cluster-x.contoso.com
                rootCaFiles:
                    - /etc/temporal/config/certs/cluster/ca/server-intermediate-ca.pem
        frontend:
            server:
                requireClientAuth: true
                certFile: /etc/temporal/config/certs/cluster/internode/cluster-internode.pem
                keyFile: /etc/temporal/config/certs/cluster/internode/cluster-internode.key
                clientCaFiles:
                    - /etc/temporal/config/certs/cluster/ca/server-intermediate-ca.pem            
        systemWorker:
            certFile: /etc/temporal/config/certs/cluster/internode/cluster-internode.pem
            keyFile: /etc/temporal/config/certs/cluster/internode/cluster-internode.key
            client:
                serverName: internode.cluster-x.contoso.com
                rootCaFiles:
                    - /etc/temporal/config/certs/cluster/ca/server-intermediate-ca.pem

I’m able to start Temporal with bash start-temporal.sh (after generating certs with bash generate-cert.sh), and I see in the server log

"temporal-sys-history-scanner-workflow workflow successfully started","service":"worker","logging-call-at":"scanner.go:186"

If I remove the systemWorker: section, Temporal is failing to start with the following error in the log, as expected.

"error starting scanner","service":"worker","error":"context deadline exceeded","logging-call-at":"service.go:242"

That’s why I suggested above to add a systemWorker: section to your config.
I wonder what’s the delta between my config and yours. I’m not seeing any. Even if I add (unnecessary with systemWorker:) client: section within frontend:, Temporal still starts fine for me.