Hello Team,
We recently deployed temporal workloads(version: 1.18.4) using helm.
Temporal servers are unable to talk to each other when istio mtls set in STRICT mode however when set back to PERMISSIVE, it’s working fine.
Here is the error logs from frontend.
{"level":"info","ts":"2022-12-05T06:36:41.311Z","msg":"matching client encountered error","service":"frontend","error":"upstream connect error or disconnect/reset before headers. reset reason: connection termination","service-error-type":"serviceerror.Unavailable","logging-call-at":"metric_client.go:220"}
{"level":"info","ts":"2022-12-05T06:36:41.558Z","msg":"matching client encountered error","service":"frontend","error":"upstream connect error or disconnect/reset before headers. reset reason: connection termination","service-error-type":"serviceerror.Unavailable","logging-call-at":"metric_client.go:220"}
{"level":"error","ts":"2022-12-05T06:36:41.558Z","msg":"Unable to call matching.PollWorkflowTaskQueue.","service":"frontend","wf-task-queue-name":"temp-temporal-worker-7996cf6f-fx4br:e1cbef58-93c0-44ae-b36f-514117cc12d0","timeout":"56.599040821s","error":"upstream connect error or disconnect/reset before headers. reset reason: connection termination","logging-call-at":"workflow_handler.go:894","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:143\ngo.temporal.io/server/service/frontend.(*WorkflowHandler).PollWorkflowTaskQueue\n\t/home/builder/temporal/service/frontend/workflow_handler.go:894\ngo.temporal.io/server/service/frontend.(*DCRedirectionHandlerImpl).PollWorkflowTaskQueue.func2\n\t/home/builder/temporal/service/frontend/dcRedirectionHandler.go:598\ngo.temporal.io/server/service/frontend.(*NoopRedirectionPolicy).WithNamespaceRedirect\n\t/home/builder/temporal/service/frontend/dcRedirectionPolicy.go:125\ngo.temporal.io/server/service/frontend.(*DCRedirectionHandlerImpl).PollWorkflowTaskQueue\n\t/home/builder/temporal/service/frontend/dcRedirectionHandler.go:594\ngo.temporal.io/api/workflowservice/v1._WorkflowService_PollWorkflowTaskQueue_Handler.func1\n\t/go/pkg/mod/go.temporal.io/api@v1.12.0/workflowservice/v1/service.pb.go:1516\ngo.temporal.io/server/common/rpc/interceptor.(*RetryableInterceptor).Intercept.func1\n\t/home/builder/temporal/common/rpc/interceptor/retry.go:63\ngo.temporal.io/server/common/backoff.ThrottleRetryContext\n\t/home/builder/temporal/common/backoff/retry.go:194\ngo.temporal.io/server/common/rpc/interceptor.(*RetryableInterceptor).Intercept\n\t/home/builder/temporal/common/rpc/interceptor/retry.go:67\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.49.0/server.go:1135\ngo.temporal.io/server/common/rpc/interceptor.(*CallerInfoInterceptor).Intercept\n\t/home/builder/temporal/common/rpc/interceptor/caller_info.go:79\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.49.0/server.go:1138\ngo.temporal.io/server/common/rpc/interceptor.(*SDKVersionInterceptor).Intercept\n\t/home/builder/temporal/common/rpc/interceptor/sdk_version.go:69\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.49.0/server.go:1138\ngo.temporal.io/server/common/rpc/interceptor.(*RateLimitInterceptor).Intercept\n\t/home/builder/temporal/common/rpc/interceptor/rate_limit.go:86\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.49.0/server.go:1138\ngo.temporal.io/server/common/rpc/interceptor.(*NamespaceRateLimitInterceptor).Intercept\n\t/home/builder/temporal/common/rpc/interceptor/namespace_rate_limit.go:91\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.49.0/server.go:1138\ngo.temporal.io/server/common/rpc/interceptor.(*NamespaceCountLimitInterceptor).Intercept\n\t/home/builder/temporal/common/rpc/interceptor/namespace_count_limit.go:99\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.49.0/server.go:1138\ngo.temporal.io/server/common/rpc/interceptor.(*NamespaceValidatorInterceptor).StateValidationIntercept\n\t/home/builder/temporal/common/rpc/interceptor/namespace_validator.go:132\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.49.0/server.go:1138\ngo.temporal.io/server/common/authorization.(*interceptor).Interceptor\n\t/home/builder/temporal/common/authorization/interceptor.go:152\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.49.0/server.go:1138\ngo.temporal.io/server/common/rpc/interceptor.(*TelemetryInterceptor).Intercept\n\t/home/builder/temporal/common/rpc/interceptor/telemetry.go:136\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.49.0/server.go:1138\ngo.temporal.io/server/common/metrics.NewServerMetricsContextInjectorInterceptor.func1\n\t/home/builder/temporal/common/metrics/grpc.go:66\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.49.0/server.go:1138\ngo.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1\n\t/go/pkg/mod/go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc@v0.32.0/interceptor.go:325\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.49.0/server.go:1138\ngo.temporal.io/server/common/rpc/interceptor.(*NamespaceLogInterceptor).Intercept\n\t/home/builder/temporal/common/rpc/interceptor/namespace_logger.go:84\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.49.0/server.go:1138\ngo.temporal.io/server/common/rpc/interceptor.(*NamespaceValidatorInterceptor).LengthValidationIntercept\n\t/home/builder/temporal/common/rpc/interceptor/namespace_validator.go:103\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.49.0/server.go:1138\ngo.temporal.io/server/common/rpc.ServiceErrorInterceptor\n\t/home/builder/temporal/common/rpc/grpc.go:137\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.49.0/server.go:1138\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1\n\t/go/pkg/mod/google.golang.org/grpc@v1.49.0/server.go:1140\ngo.temporal.io/api/workflowservice/v1._WorkflowService_PollWorkflowTaskQueue_Handler\n\t/go/pkg/mod/go.temporal.io/api@v1.12.0/workflowservice/v1/service.pb.go:1518\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.49.0/server.go:1301\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.49.0/server.go:1642\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.49.0/server.go:938"}
Here are the error logs from history
{"level":"info","ts":"2022-12-05T06:00:09.073Z","msg":"matching client encountered error","service":"history","error":"upstream request timeout","service-error-type":"serviceerror.Unavailable","logging-call-at":"metric_client.go:220"}
{"level":"error","ts":"2022-12-05T06:00:09.073Z","msg":"Fail to process task","shard-id":165,"address":"10.105.64.25:7234","component":"transfer-queue-processor","cluster-name":"active","wf-namespace-id":"d4ca24ab-8544-4220-953b-36dae1bb1032","wf-id":"AutoRemoveInactiveCustomerWorkflow","wf-run-id":"f58cfae5-1d87-45ed-b2b7-6a56685fed77","queue-task-id":1512046611,"queue-task-visibility-timestamp":"2022-12-05T06:00:00.575Z","queue-task-type":"TransferWorkflowTask","queue-task":{"NamespaceID":"d4ca24ab-8544-4220-953b-36dae1bb1032","WorkflowID":"AutoRemoveInactiveCustomerWorkflow","RunID":"f58cfae5-1d87-45ed-b2b7-6a56685fed77","VisibilityTimestamp":"2022-12-05T06:00:00.57577858Z","TaskID":1512046611,"TaskQueue":"KYC","ScheduledEventID":2,"Version":0},"wf-history-event-id":2,"error":"context deadline exceeded","lifecycle":"ProcessingFailed","logging-call-at":"lazy_logger.go:68","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:143\ngo.temporal.io/server/common/log.(*lazyLogger).Error\n\t/home/builder/temporal/common/log/lazy_logger.go:68\ngo.temporal.io/server/service/history/queues.(*executableImpl).HandleErr\n\t/home/builder/temporal/service/history/queues/executable.go:294\ngo.temporal.io/server/common/tasks.(*FIFOScheduler[...]).executeTask.func1\n\t/home/builder/temporal/common/tasks/fifo_scheduler.go:226\ngo.temporal.io/server/common/backoff.ThrottleRetry.func1\n\t/home/builder/temporal/common/backoff/retry.go:170\ngo.temporal.io/server/common/backoff.ThrottleRetryContext\n\t/home/builder/temporal/common/backoff/retry.go:194\ngo.temporal.io/server/common/backoff.ThrottleRetry\n\t/home/builder/temporal/common/backoff/retry.go:171\ngo.temporal.io/server/common/tasks.(*FIFOScheduler[...]).executeTask\n\t/home/builder/temporal/common/tasks/fifo_scheduler.go:235\ngo.temporal.io/server/common/tasks.(*FIFOScheduler[...]).processTask\n\t/home/builder/temporal/common/tasks/fifo_scheduler.go:211"}
from matching
{"level":"warn","ts":"2022-12-05T06:39:07.952Z","msg":"Failed to poll for task.","service":"worker","Namespace":"temporal-system","TaskQueue":"temporal-sys-tq-scanner-taskqueue-0","WorkerID":"1@temp-temporal-worker-7996cf6f-fx4br@","WorkerType":"ActivityWorker","Error":"context deadline exceeded","logging-call-at":"internal_worker_base.go:298"}
{"level":"warn","ts":"2022-12-05T06:39:14.852Z","msg":"Failed to poll for task.","service":"worker","Namespace":"temporal-system","TaskQueue":"temporal-sys-tq-scanner-taskqueue-0","WorkerID":"1@temp-temporal-worker-7996cf6f-fx4br@","WorkerType":"WorkflowWorker","Error":"context deadline exceeded","logging-call-at":"internal_worker_base.go:298"}
{"level":"warn","ts":"2022-12-05T06:39:22.429Z","msg":"Failed to poll for task.","service":"worker","Namespace":"temporal-system","TaskQueue":"temporal-sys-tq-scanner-taskqueue-0","WorkerID":"1@temp-temporal-worker-7996cf6f-fx4br@","WorkerType":"ActivityWorker","Error":"upstream connect error or disconnect/reset before headers. reset reason: connection termination","logging-call-at":"internal_worker_base.go:298"}
{"level":"warn","ts":"2022-12-05T06:39:23.573Z","msg":"Failed to poll for task.","service":"worker","Namespace":"temporal-system","TaskQueue":"temporal-sys-tq-scanner-taskqueue-0","WorkerID":"1@temp-temporal-worker-7996cf6f-fx4br@","WorkerType":"WorkflowWorker","Error":"upstream connect error or disconnect/reset before headers. reset reason: connection termination","logging-call-at":"internal_worker_base.go:298"}
I tried to test the connectivity between the pods using grpcurl, seems no such issue
root@test:/# grpcurl -plaintext temp-temporal-matching-headless:7235 list
Failed to list services: server does not support the reflection API
root@test:/# grpcurl -plaintext temp-temporal-frontend-headless:7233 list
grpc.health.v1.Health
grpc.reflection.v1alpha.ServerReflection
temporal.api.operatorservice.v1.OperatorService
temporal.api.workflowservice.v1.WorkflowService
temporal.server.api.adminservice.v1.AdminService
FYI: we define resource requests on workloads and using no quotas on namespace.
Please do suggest what can be checked and let us know if we are missing something
updated - 6/12/22:
I found that there is connectivity between server components when using kubernetes service names however when IP address of pods used, connections are terminated. Do they use POD IPs to talk to each other? if so, can we change this behaviour by setting domain name?
thanks in advance.