Temporal workload unable to talk to each other when STRCT MTLS enabled in Istio

Hello Team,
We recently deployed temporal workloads(version: 1.18.4) using helm.
Temporal servers are unable to talk to each other when istio mtls set in STRICT mode however when set back to PERMISSIVE, it’s working fine.
Here is the error logs from frontend.

{"level":"info","ts":"2022-12-05T06:36:41.311Z","msg":"matching client encountered error","service":"frontend","error":"upstream connect error or disconnect/reset before headers. reset reason: connection termination","service-error-type":"serviceerror.Unavailable","logging-call-at":"metric_client.go:220"}
{"level":"info","ts":"2022-12-05T06:36:41.558Z","msg":"matching client encountered error","service":"frontend","error":"upstream connect error or disconnect/reset before headers. reset reason: connection termination","service-error-type":"serviceerror.Unavailable","logging-call-at":"metric_client.go:220"}
{"level":"error","ts":"2022-12-05T06:36:41.558Z","msg":"Unable to call matching.PollWorkflowTaskQueue.","service":"frontend","wf-task-queue-name":"temp-temporal-worker-7996cf6f-fx4br:e1cbef58-93c0-44ae-b36f-514117cc12d0","timeout":"56.599040821s","error":"upstream connect error or disconnect/reset before headers. reset reason: connection termination","logging-call-at":"workflow_handler.go:894","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:143\ngo.temporal.io/server/service/frontend.(*WorkflowHandler).PollWorkflowTaskQueue\n\t/home/builder/temporal/service/frontend/workflow_handler.go:894\ngo.temporal.io/server/service/frontend.(*DCRedirectionHandlerImpl).PollWorkflowTaskQueue.func2\n\t/home/builder/temporal/service/frontend/dcRedirectionHandler.go:598\ngo.temporal.io/server/service/frontend.(*NoopRedirectionPolicy).WithNamespaceRedirect\n\t/home/builder/temporal/service/frontend/dcRedirectionPolicy.go:125\ngo.temporal.io/server/service/frontend.(*DCRedirectionHandlerImpl).PollWorkflowTaskQueue\n\t/home/builder/temporal/service/frontend/dcRedirectionHandler.go:594\ngo.temporal.io/api/workflowservice/v1._WorkflowService_PollWorkflowTaskQueue_Handler.func1\n\t/go/pkg/mod/go.temporal.io/api@v1.12.0/workflowservice/v1/service.pb.go:1516\ngo.temporal.io/server/common/rpc/interceptor.(*RetryableInterceptor).Intercept.func1\n\t/home/builder/temporal/common/rpc/interceptor/retry.go:63\ngo.temporal.io/server/common/backoff.ThrottleRetryContext\n\t/home/builder/temporal/common/backoff/retry.go:194\ngo.temporal.io/server/common/rpc/interceptor.(*RetryableInterceptor).Intercept\n\t/home/builder/temporal/common/rpc/interceptor/retry.go:67\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.49.0/server.go:1135\ngo.temporal.io/server/common/rpc/interceptor.(*CallerInfoInterceptor).Intercept\n\t/home/builder/temporal/common/rpc/interceptor/caller_info.go:79\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.49.0/server.go:1138\ngo.temporal.io/server/common/rpc/interceptor.(*SDKVersionInterceptor).Intercept\n\t/home/builder/temporal/common/rpc/interceptor/sdk_version.go:69\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.49.0/server.go:1138\ngo.temporal.io/server/common/rpc/interceptor.(*RateLimitInterceptor).Intercept\n\t/home/builder/temporal/common/rpc/interceptor/rate_limit.go:86\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.49.0/server.go:1138\ngo.temporal.io/server/common/rpc/interceptor.(*NamespaceRateLimitInterceptor).Intercept\n\t/home/builder/temporal/common/rpc/interceptor/namespace_rate_limit.go:91\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.49.0/server.go:1138\ngo.temporal.io/server/common/rpc/interceptor.(*NamespaceCountLimitInterceptor).Intercept\n\t/home/builder/temporal/common/rpc/interceptor/namespace_count_limit.go:99\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.49.0/server.go:1138\ngo.temporal.io/server/common/rpc/interceptor.(*NamespaceValidatorInterceptor).StateValidationIntercept\n\t/home/builder/temporal/common/rpc/interceptor/namespace_validator.go:132\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.49.0/server.go:1138\ngo.temporal.io/server/common/authorization.(*interceptor).Interceptor\n\t/home/builder/temporal/common/authorization/interceptor.go:152\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.49.0/server.go:1138\ngo.temporal.io/server/common/rpc/interceptor.(*TelemetryInterceptor).Intercept\n\t/home/builder/temporal/common/rpc/interceptor/telemetry.go:136\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.49.0/server.go:1138\ngo.temporal.io/server/common/metrics.NewServerMetricsContextInjectorInterceptor.func1\n\t/home/builder/temporal/common/metrics/grpc.go:66\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.49.0/server.go:1138\ngo.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1\n\t/go/pkg/mod/go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc@v0.32.0/interceptor.go:325\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.49.0/server.go:1138\ngo.temporal.io/server/common/rpc/interceptor.(*NamespaceLogInterceptor).Intercept\n\t/home/builder/temporal/common/rpc/interceptor/namespace_logger.go:84\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.49.0/server.go:1138\ngo.temporal.io/server/common/rpc/interceptor.(*NamespaceValidatorInterceptor).LengthValidationIntercept\n\t/home/builder/temporal/common/rpc/interceptor/namespace_validator.go:103\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.49.0/server.go:1138\ngo.temporal.io/server/common/rpc.ServiceErrorInterceptor\n\t/home/builder/temporal/common/rpc/grpc.go:137\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.49.0/server.go:1138\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1\n\t/go/pkg/mod/google.golang.org/grpc@v1.49.0/server.go:1140\ngo.temporal.io/api/workflowservice/v1._WorkflowService_PollWorkflowTaskQueue_Handler\n\t/go/pkg/mod/go.temporal.io/api@v1.12.0/workflowservice/v1/service.pb.go:1518\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.49.0/server.go:1301\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.49.0/server.go:1642\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.49.0/server.go:938"}

Here are the error logs from history

{"level":"info","ts":"2022-12-05T06:00:09.073Z","msg":"matching client encountered error","service":"history","error":"upstream request timeout","service-error-type":"serviceerror.Unavailable","logging-call-at":"metric_client.go:220"}
{"level":"error","ts":"2022-12-05T06:00:09.073Z","msg":"Fail to process task","shard-id":165,"address":"10.105.64.25:7234","component":"transfer-queue-processor","cluster-name":"active","wf-namespace-id":"d4ca24ab-8544-4220-953b-36dae1bb1032","wf-id":"AutoRemoveInactiveCustomerWorkflow","wf-run-id":"f58cfae5-1d87-45ed-b2b7-6a56685fed77","queue-task-id":1512046611,"queue-task-visibility-timestamp":"2022-12-05T06:00:00.575Z","queue-task-type":"TransferWorkflowTask","queue-task":{"NamespaceID":"d4ca24ab-8544-4220-953b-36dae1bb1032","WorkflowID":"AutoRemoveInactiveCustomerWorkflow","RunID":"f58cfae5-1d87-45ed-b2b7-6a56685fed77","VisibilityTimestamp":"2022-12-05T06:00:00.57577858Z","TaskID":1512046611,"TaskQueue":"KYC","ScheduledEventID":2,"Version":0},"wf-history-event-id":2,"error":"context deadline exceeded","lifecycle":"ProcessingFailed","logging-call-at":"lazy_logger.go:68","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:143\ngo.temporal.io/server/common/log.(*lazyLogger).Error\n\t/home/builder/temporal/common/log/lazy_logger.go:68\ngo.temporal.io/server/service/history/queues.(*executableImpl).HandleErr\n\t/home/builder/temporal/service/history/queues/executable.go:294\ngo.temporal.io/server/common/tasks.(*FIFOScheduler[...]).executeTask.func1\n\t/home/builder/temporal/common/tasks/fifo_scheduler.go:226\ngo.temporal.io/server/common/backoff.ThrottleRetry.func1\n\t/home/builder/temporal/common/backoff/retry.go:170\ngo.temporal.io/server/common/backoff.ThrottleRetryContext\n\t/home/builder/temporal/common/backoff/retry.go:194\ngo.temporal.io/server/common/backoff.ThrottleRetry\n\t/home/builder/temporal/common/backoff/retry.go:171\ngo.temporal.io/server/common/tasks.(*FIFOScheduler[...]).executeTask\n\t/home/builder/temporal/common/tasks/fifo_scheduler.go:235\ngo.temporal.io/server/common/tasks.(*FIFOScheduler[...]).processTask\n\t/home/builder/temporal/common/tasks/fifo_scheduler.go:211"}

from matching

{"level":"warn","ts":"2022-12-05T06:39:07.952Z","msg":"Failed to poll for task.","service":"worker","Namespace":"temporal-system","TaskQueue":"temporal-sys-tq-scanner-taskqueue-0","WorkerID":"1@temp-temporal-worker-7996cf6f-fx4br@","WorkerType":"ActivityWorker","Error":"context deadline exceeded","logging-call-at":"internal_worker_base.go:298"}
{"level":"warn","ts":"2022-12-05T06:39:14.852Z","msg":"Failed to poll for task.","service":"worker","Namespace":"temporal-system","TaskQueue":"temporal-sys-tq-scanner-taskqueue-0","WorkerID":"1@temp-temporal-worker-7996cf6f-fx4br@","WorkerType":"WorkflowWorker","Error":"context deadline exceeded","logging-call-at":"internal_worker_base.go:298"}
{"level":"warn","ts":"2022-12-05T06:39:22.429Z","msg":"Failed to poll for task.","service":"worker","Namespace":"temporal-system","TaskQueue":"temporal-sys-tq-scanner-taskqueue-0","WorkerID":"1@temp-temporal-worker-7996cf6f-fx4br@","WorkerType":"ActivityWorker","Error":"upstream connect error or disconnect/reset before headers. reset reason: connection termination","logging-call-at":"internal_worker_base.go:298"}
{"level":"warn","ts":"2022-12-05T06:39:23.573Z","msg":"Failed to poll for task.","service":"worker","Namespace":"temporal-system","TaskQueue":"temporal-sys-tq-scanner-taskqueue-0","WorkerID":"1@temp-temporal-worker-7996cf6f-fx4br@","WorkerType":"WorkflowWorker","Error":"upstream connect error or disconnect/reset before headers. reset reason: connection termination","logging-call-at":"internal_worker_base.go:298"}

I tried to test the connectivity between the pods using grpcurl, seems no such issue

root@test:/# grpcurl -plaintext temp-temporal-matching-headless:7235 list
Failed to list services: server does not support the reflection API

root@test:/# grpcurl -plaintext temp-temporal-frontend-headless:7233 list
grpc.health.v1.Health
grpc.reflection.v1alpha.ServerReflection
temporal.api.operatorservice.v1.OperatorService
temporal.api.workflowservice.v1.WorkflowService
temporal.server.api.adminservice.v1.AdminService

FYI: we define resource requests on workloads and using no quotas on namespace.

Please do suggest what can be checked and let us know if we are missing something

updated - 6/12/22:

I found that there is connectivity between server components when using kubernetes service names however when IP address of pods used, connections are terminated. Do they use POD IPs to talk to each other? if so, can we change this behaviour by setting domain name?

thanks in advance.

Do you set global->membership->broadcastAddress to your config or set POD_IP env var for the pods?
broadcastAddress is the address used by ringpop to communicate membership info, default 0.0.0.0.

If you are deploying using Temporal helm charts my guess is this is caused by config setting status.podIP here.
Maybe something like this helps?

Yes. POD_IP variable is already set by default.
Just wondering how frontend knows other servers (matching/worker) IP address. currently, broadcast is set to it’s own IP in frontend.

We are also facing same issue in our environment, help/pointer here is highly appreciated.

I have managed to set up this successfully. I came across a few nuances/issues with using Istio when I was getting it set up. Two problems I remember encountering were:

  1. According to Istio, for headless services to work automatically, the ports must be declared in a Service resource
  2. The Temporal Helm chart uses labels that cause Istio to identify the traffic as GRPC (prefixed ‘grpc-’) - I found the Envoy GRPC proxy interfered with with it causing conneectivity failures

This is what one of my services looks like now:

apiVersion: v1
kind: Service
metadata:
  annotations:
    argocd.argoproj.io/sync-wave: "5"
  labels:
    app: temporal-frontend
    app.kubernetes.io/headless: 'true'
  name: frontend-headless
spec:
  clusterIP: None
  ports:
  - name: grpc-rpc
    port: 7233
    appProtocol: tcp
    protocol: TCP
    targetPort: rpc
  - name: grpc-membership
    port: 6933
    appProtocol: tcp
    protocol: TCP
    targetPort: membership
  - name: metrics
    port: 9090
    appProtocol: http
    protocol: TCP
    targetPort: metrics
  publishNotReadyAddresses: true
  selector:
    app: temporal-frontend
  type: ClusterIP

Notice the use of appProtocol to force the traffic to be detected as plain old TCP traffic by Istio. I’d be interested to know if it is workable when autodetected as GRPC traffic and if so how this is acheived.

1 Like

Hey @craigd thanks for the info. while it’s actually working, we face weird issue that unable to make it work under temporal namespace but working for other namespaces :sweat_smile:.

Hey @craigd, thanks a lot for sharing the details, with your suggest fix I am able to deploy temporal successfully in istio with STRICT MTLS enabled.

Hi, @craigd @mmlk I am facing a similar issue with temporal server failure with Istio mtls set to STRICT.

Its failing with Unable to bootstrap Ringpop error. Everything works well when Istio mtls is set to PERMISSIVE.

I have created a topic with al the information

https://community.temporal.io/t/unable-to-bootstrap-ringpop-when-strict-mtls-is-enabled-in-istio/9274

Could anyone take a look and see if you can point out the changes needed for it to work.

Thanks in Advance

@craigd Thank you so much for the solution you’ve described! I was stuck on the same problem for two days, already set up Istio ServiceEntries and exposed ports on the headless services, but the last piece I was missing was the appProtocol: tcp specification. Now all temporal services seem to work.