Activity stuck after activity timeout

We have “map-reduce like” workflow and we run long-running activities (~ 1-10 min). Activities send heartbeat every ~15 sec. Time from time - activity gets stuck on temporal (getting activity timeout) and afterwards it hangs > 30 minutes

What may be the reasons and how can I debug it further?

Errors on temporal-history

{"level":"error","ts":"2021-05-27T11:37:32.985Z","msg":"Fail to process task","service":"history","shard-id":122,"address":"10.132.4.163:7234","shard-item":"0xc004450380","component":"transfer-queue-processor","cluster-name":"active","shard-id":122,"queue-task-id":2446328406,"queue-task-visibility-timestamp":1622115452972079260,"xdc-failover-version":0,"queue-task-type":"TransferActivityTask","wf-namespace-id":"8dcbae60-9def-4627-8339-5e0f42be3a18","wf-id":"sev_test_7vGZ_8579","wf-run-id":"5a8e4e58-349b-4ed1-aaf2-70cb193a746a","error":"task queue shutting down","lifecycle":"ProcessingFailed","logging-call-at":"taskProcessor.go:326","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:136\ngo.temporal.io/server/service/history.(*taskProcessor).handleTaskError\n\t/temporal/service/history/taskProcessor.go:326\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck.func1\n\t/temporal/service/history/taskProcessor.go:212\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:103\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck\n\t/temporal/service/history/taskProcessor.go:238\ngo.temporal.io/server/service/history.(*taskProcessor).taskWorker\n\t/temporal/service/history/taskProcessor.go:161"}
{"level":"error","ts":"2021-05-27T11:38:18.690Z","msg":"Fail to process task","service":"history","shard-id":122,"address":"10.132.4.163:7234","shard-item":"0xc004450380","component":"transfer-queue-processor","cluster-name":"active","shard-id":122,"queue-task-id":2446328541,"queue-task-visibility-timestamp":1622115498676359011,"xdc-failover-version":0,"queue-task-type":"TransferActivityTask","wf-namespace-id":"8dcbae60-9def-4627-8339-5e0f42be3a18","wf-id":"sev_test_7vGZ_8579","wf-run-id":"5a8e4e58-349b-4ed1-aaf2-70cb193a746a","error":"task queue shutting down","lifecycle":"ProcessingFailed","logging-call-at":"taskProcessor.go:326","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:136\ngo.temporal.io/server/service/history.(*taskProcessor).handleTaskError\n\t/temporal/service/history/taskProcessor.go:326\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck.func1\n\t/temporal/service/history/taskProcessor.go:212\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:103\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck\n\t/temporal/service/history/taskProcessor.go:238\ngo.temporal.io/server/service/history.(*taskProcessor).taskWorker\n\t/temporal/service/history/taskProcessor.go:161"}
{"level":"error","ts":"2021-05-27T11:40:00.991Z","msg":"Fail to process task","service":"history","shard-id":38,"address":"10.132.4.163:7234","shard-item":"0xc000faea00","component":"transfer-queue-processor","cluster-name":"active","shard-id":38,"queue-task-id":2468352173,"queue-task-visibility-timestamp":1622115600975714926,"xdc-failover-version":0,"queue-task-type":"TransferActivityTask","wf-namespace-id":"8dcbae60-9def-4627-8339-5e0f42be3a18","wf-id":"sev_test_7vGZ_8577","wf-run-id":"e28d235b-bf71-461c-971c-6352329ada70","error":"task queue shutting down","lifecycle":"ProcessingFailed","logging-call-at":"taskProcessor.go:326","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:136\ngo.temporal.io/server/service/history.(*taskProcessor).handleTaskError\n\t/temporal/service/history/taskProcessor.go:326\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck.func1\n\t/temporal/service/history/taskProcessor.go:212\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:103\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck\n\t/temporal/service/history/taskProcessor.go:238\ngo.temporal.io/server/service/history.(*taskProcessor).taskWorker\n\t/temporal/service/history/taskProcessor.go:161"}

at the same time on temporal-matching I see errors like

{"level":"error","ts":"2021-05-27T11:38:18.689Z","msg":"uncategorized error","operation":"AddActivityTask","error":"task queue shutting down","logging-call-at":"telemetry.go:163","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:136\ngo.temporal.io/server/common/rpc/interceptor.(*TelemetryInterceptor).handleError\n\t/temporal/common/rpc/interceptor/telemetry.go:163\ngo.temporal.io/server/common/rpc/interceptor.(*TelemetryInterceptor).Intercept\n\t/temporal/common/rpc/interceptor/telemetry.go:115\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\t/go/pkg/mod/google.golang.org/grpc@v1.37.0/server.go:1058\ngo.temporal.io/server/common/rpc.ServiceErrorInterceptor\n\t/temporal/common/rpc/grpc.go:105\ngoogle.golang.org/grpc.chainUnaryServerInterceptors.func1\n\t/go/pkg/mod/google.golang.org/grpc@v1.37.0/server.go:1044\ngo.temporal.io/server/api/matchingservice/v1._MatchingService_AddActivityTask_Handler\n\t/temporal/api/matchingservice/v1/service.pb.go:359\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.37.0/server.go:1217\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.37.0/server.go:1540\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.37.0/server.go:878"}
{"level":"error","ts":"2021-05-27T11:40:46.482Z","msg":"uncategorized error","operation":"AddActivityTask","error":"task queue shutting down","logging-call-at":"telemetry.go:163","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:136\ngo.temporal.io/server/common/rpc/interceptor.(*TelemetryInterceptor).handleError\n\t/temporal/common/rpc/interceptor/telemetry.go:163\ngo.temporal.io/server/common/rpc/interceptor.(*TelemetryInterceptor).Intercept\n\t/temporal/common/rpc/interceptor/telemetry.go:115\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\t/go/pkg/mod/google.golang.org/grpc@v1.37.0/server.go:1058\ngo.temporal.io/server/common/rpc.ServiceErrorInterceptor\n\t/temporal/common/rpc/grpc.go:105\ngoogle.golang.org/grpc.chainUnaryServerInterceptors.func1\n\t/go/pkg/mod/google.golang.org/grpc@v1.37.0/server.go:1044\ngo.temporal.io/server/api/matchingservice/v1._MatchingService_AddActivityTask_Handler\n\t/temporal/api/matchingservice/v1/service.pb.go:359\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.37.0/server.go:1217\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.37.0/server.go:1540\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.37.0/server.go:878"}

What is the heartbeat timeout? If it is not set the heartbeats are ignored.

heartbeat timeout is set to 2 * time.Minute

Would you post the history of such a stuck workflow?

see json export here

between 1:42 and 2:05 and 2:25 - 2:48 nothing happened

Could you post the eventId of these events? The timestamps in the JSON version of the history are not as friendly.

Most visible on 2848/2849
others 2812/2813 and 2824/2825

best way to grep for “activity timeout” - all those activities took extremely high amount of time between start and completion.

For example expected time of execution is like 2655/2656.

I has nothing to do with CPU throttling - both CPU on temporal and activity executor were almost idle at the time of execution

I do also have persistence errors on GetTasks even if no workflows are running - may it be related/expected?

I don’t see anything wrong from the Temporal side. The activity was scheduled at 1622115270 (11:34:30 AM). Then it timed out due to TIMEOUT_TYPE_HEARTBEAT and then retried at 1622115573 (11:39:33 AM) and completed successfully at 1622119682 (12:48:02 PM).

I found 21 activity that has attempt == 2. So there were quite a few retries that contributed to the latency.

  {
    "eventId": "12",
    "eventTime": {
      "seconds": "1622115270",
      "nanos": 334989911
    },
    "eventType": "EVENT_TYPE_ACTIVITY_TASK_SCHEDULED",
    "version": "0",
    "taskId": "2468347920",
    "activityTaskScheduledEventAttributes": {
      "activityId": "12",
      "activityType": {
        "name": "fcm-sending"
      },
      "namespace": "",
      "taskQueue": {
        "name": "pns-push",
        "kind": "TASK_QUEUE_KIND_NORMAL"
      },
      "header": {
        "fields": {}
      },
      "input": {
          ...
      },
      "scheduleToCloseTimeout": {
        "seconds": "0",
        "nanos": 0
      },
      "scheduleToStartTimeout": {
        "seconds": "36000",
        "nanos": 0
      },
      "startToCloseTimeout": {
        "seconds": "10800",
        "nanos": 0
      },
      "heartbeatTimeout": {
        "seconds": "120",
        "nanos": 0
      },
      "workflowTaskCompletedEventId": "4",
      "retryPolicy": {
        "nonRetryableErrorTypes": [],
        "initialInterval": {
          "seconds": "1",
          "nanos": 0
        },
        "backoffCoefficient": 2,
        "maximumInterval": {
          "seconds": "100",
          "nanos": 0
        },
        "maximumAttempts": 5
      }
    },
    "attributes": "activityTaskScheduledEventAttributes"
  },
....
  {
    "eventId": "2848",
    "eventTime": {
      "seconds": "1622115573",
      "nanos": 586149784
    },
    "eventType": "EVENT_TYPE_ACTIVITY_TASK_STARTED",
    "version": "0",
    "taskId": "2489319425",
    "activityTaskStartedEventAttributes": {
      "scheduledEventId": "12",
      "identity": "1@pusher-6cdb6586ff-s22zr@",
      "requestId": "c112606b-7516-4473-98e7-8c722461ba04",
      "attempt": 2,
      "lastFailure": {
        "message": "activity timeout",
        "source": "Server",
        "stackTrace": "",
        "timeoutFailureInfo": {
          "timeoutType": "TIMEOUT_TYPE_HEARTBEAT"
        },
        "failureInfo": "timeoutFailureInfo"
      }
    },
    "attributes": "activityTaskStartedEventAttributes"
  },
  {
    "eventId": "2849",
    "eventTime": {
      "seconds": "1622119682",
      "nanos": 235784251
    },
    "eventType": "EVENT_TYPE_ACTIVITY_TASK_COMPLETED",
    "version": "0",
    "taskId": "2489319426",
    "activityTaskCompletedEventAttributes": {
      "result": {
        "payloads": [
          {
            "metadata": {
              "encoding": "anNvbi9wbGFpbg=="
            },
            "data": "eyJieXRlc19za2lwcGVkIjoyMCwiYnl0ZXNfcmVhZCI6MjAxNDEzNzAyLCJzdWNjZXNzIjo5OTI2MDcsImZhaWxlZCI6MzEwNDgsInRvX3JldHJ5IjoyNiwiQWN0aXZpdHlJRCI6IjEyIiwiUmV0cnlObyI6MH0="
          }
        ]
      },
      "scheduledEventId": "12",
      "startedEventId": "2848",
      "identity": "1@pusher-6cdb6586ff-s22zr@"
    },
    "attributes": "activityTaskCompletedEventAttributes"
  },

thanks for your input, maybe problem lays somewhere in activity code (activity is not expected to run so long 11:39:33 AM -12:48:02 PM). I’ll keep investigating and let you know if I’ll find something