We have “map-reduce like” workflow and we run long-running activities (~ 1-10 min). Activities send heartbeat every ~15 sec. Time from time - activity gets stuck on temporal (getting activity timeout) and afterwards it hangs > 30 minutes
What may be the reasons and how can I debug it further?
Errors on temporal-history
{"level":"error","ts":"2021-05-27T11:37:32.985Z","msg":"Fail to process task","service":"history","shard-id":122,"address":"10.132.4.163:7234","shard-item":"0xc004450380","component":"transfer-queue-processor","cluster-name":"active","shard-id":122,"queue-task-id":2446328406,"queue-task-visibility-timestamp":1622115452972079260,"xdc-failover-version":0,"queue-task-type":"TransferActivityTask","wf-namespace-id":"8dcbae60-9def-4627-8339-5e0f42be3a18","wf-id":"sev_test_7vGZ_8579","wf-run-id":"5a8e4e58-349b-4ed1-aaf2-70cb193a746a","error":"task queue shutting down","lifecycle":"ProcessingFailed","logging-call-at":"taskProcessor.go:326","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:136\ngo.temporal.io/server/service/history.(*taskProcessor).handleTaskError\n\t/temporal/service/history/taskProcessor.go:326\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck.func1\n\t/temporal/service/history/taskProcessor.go:212\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:103\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck\n\t/temporal/service/history/taskProcessor.go:238\ngo.temporal.io/server/service/history.(*taskProcessor).taskWorker\n\t/temporal/service/history/taskProcessor.go:161"}
{"level":"error","ts":"2021-05-27T11:38:18.690Z","msg":"Fail to process task","service":"history","shard-id":122,"address":"10.132.4.163:7234","shard-item":"0xc004450380","component":"transfer-queue-processor","cluster-name":"active","shard-id":122,"queue-task-id":2446328541,"queue-task-visibility-timestamp":1622115498676359011,"xdc-failover-version":0,"queue-task-type":"TransferActivityTask","wf-namespace-id":"8dcbae60-9def-4627-8339-5e0f42be3a18","wf-id":"sev_test_7vGZ_8579","wf-run-id":"5a8e4e58-349b-4ed1-aaf2-70cb193a746a","error":"task queue shutting down","lifecycle":"ProcessingFailed","logging-call-at":"taskProcessor.go:326","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:136\ngo.temporal.io/server/service/history.(*taskProcessor).handleTaskError\n\t/temporal/service/history/taskProcessor.go:326\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck.func1\n\t/temporal/service/history/taskProcessor.go:212\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:103\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck\n\t/temporal/service/history/taskProcessor.go:238\ngo.temporal.io/server/service/history.(*taskProcessor).taskWorker\n\t/temporal/service/history/taskProcessor.go:161"}
{"level":"error","ts":"2021-05-27T11:40:00.991Z","msg":"Fail to process task","service":"history","shard-id":38,"address":"10.132.4.163:7234","shard-item":"0xc000faea00","component":"transfer-queue-processor","cluster-name":"active","shard-id":38,"queue-task-id":2468352173,"queue-task-visibility-timestamp":1622115600975714926,"xdc-failover-version":0,"queue-task-type":"TransferActivityTask","wf-namespace-id":"8dcbae60-9def-4627-8339-5e0f42be3a18","wf-id":"sev_test_7vGZ_8577","wf-run-id":"e28d235b-bf71-461c-971c-6352329ada70","error":"task queue shutting down","lifecycle":"ProcessingFailed","logging-call-at":"taskProcessor.go:326","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:136\ngo.temporal.io/server/service/history.(*taskProcessor).handleTaskError\n\t/temporal/service/history/taskProcessor.go:326\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck.func1\n\t/temporal/service/history/taskProcessor.go:212\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:103\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck\n\t/temporal/service/history/taskProcessor.go:238\ngo.temporal.io/server/service/history.(*taskProcessor).taskWorker\n\t/temporal/service/history/taskProcessor.go:161"}
at the same time on temporal-matching I see errors like
{"level":"error","ts":"2021-05-27T11:38:18.689Z","msg":"uncategorized error","operation":"AddActivityTask","error":"task queue shutting down","logging-call-at":"telemetry.go:163","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:136\ngo.temporal.io/server/common/rpc/interceptor.(*TelemetryInterceptor).handleError\n\t/temporal/common/rpc/interceptor/telemetry.go:163\ngo.temporal.io/server/common/rpc/interceptor.(*TelemetryInterceptor).Intercept\n\t/temporal/common/rpc/interceptor/telemetry.go:115\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\t/go/pkg/mod/google.golang.org/grpc@v1.37.0/server.go:1058\ngo.temporal.io/server/common/rpc.ServiceErrorInterceptor\n\t/temporal/common/rpc/grpc.go:105\ngoogle.golang.org/grpc.chainUnaryServerInterceptors.func1\n\t/go/pkg/mod/google.golang.org/grpc@v1.37.0/server.go:1044\ngo.temporal.io/server/api/matchingservice/v1._MatchingService_AddActivityTask_Handler\n\t/temporal/api/matchingservice/v1/service.pb.go:359\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.37.0/server.go:1217\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.37.0/server.go:1540\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.37.0/server.go:878"}
{"level":"error","ts":"2021-05-27T11:40:46.482Z","msg":"uncategorized error","operation":"AddActivityTask","error":"task queue shutting down","logging-call-at":"telemetry.go:163","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:136\ngo.temporal.io/server/common/rpc/interceptor.(*TelemetryInterceptor).handleError\n\t/temporal/common/rpc/interceptor/telemetry.go:163\ngo.temporal.io/server/common/rpc/interceptor.(*TelemetryInterceptor).Intercept\n\t/temporal/common/rpc/interceptor/telemetry.go:115\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\t/go/pkg/mod/google.golang.org/grpc@v1.37.0/server.go:1058\ngo.temporal.io/server/common/rpc.ServiceErrorInterceptor\n\t/temporal/common/rpc/grpc.go:105\ngoogle.golang.org/grpc.chainUnaryServerInterceptors.func1\n\t/go/pkg/mod/google.golang.org/grpc@v1.37.0/server.go:1044\ngo.temporal.io/server/api/matchingservice/v1._MatchingService_AddActivityTask_Handler\n\t/temporal/api/matchingservice/v1/service.pb.go:359\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.37.0/server.go:1217\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.37.0/server.go:1540\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.37.0/server.go:878"}