Why the ActivityTaskTimeout received?

Hello, community. We recently started with temporal in the prod and unfortunately found that most activities end with ActivityTaskTimeout. And there is no error or exception, just that the activity did not execute in 2 hours. We have temporal deployed in kubernetes, with default helm chart taken from here GitHub - temporalio/helm-charts: Temporal Helm charts. There are 3 servers running on each of the worker, frontend and other temporal services. We use postgres as our database. The database is located on one server and the temporal itself on another. We are not ruling out network problems, but we would like to hear your advice on what could be wrong?

Can you post the ActivityTaskScheduled event? I want to see what timeouts are configured for the activity. Also, the ActivityTaskStarted.attempts field for the failed activity.

{
    "eventId": "7",
    "eventTime": "2022-09-02T16:41:01.186539657Z",
    "eventType": "ActivityTaskTimedOut",
    "version": "0",
    "taskId": "17825830",
    "activityTaskTimedOutEventAttributes": {
      "failure": {
        "message": "activity ScheduleToClose timeout",
        "source": "Server",
        "stackTrace": "",
        "cause": null,
        "timeoutFailureInfo": {
          "timeoutType": "ScheduleToClose",
          "lastHeartbeatDetails": null
        }
      },
      "scheduledEventId": "5",
      "startedEventId": "6",
      "retryState": "NonRetryableFailure"
    }
  },
{
    "eventId": "6",
    "eventTime": "2022-09-02T14:41:01.308559798Z",
    "eventType": "ActivityTaskStarted",
    "version": "0",
    "taskId": "17825829",
    "activityTaskStartedEventAttributes": {
      "scheduledEventId": "5",
      "identity": "payments:4f8881cd-a9ad-4e84-bdb0-941045565bdf",
      "requestId": "6291d31c-ee21-43fd-9468-e939abd13cf8",
      "attempt": 1,
      "lastFailure": null
    }
  },

What about ActivityTaskScheduled?

Filed a feature request to include the activity failure in the timeout message as a cause.

{
    "eventId": "5",
    "eventTime": "2022-09-02T14:41:01.174819696Z",
    "eventType": "ActivityTaskScheduled",
    "version": "0",
    "taskId": "14680491",
    "activityTaskScheduledEventAttributes": {
      "activityId": "5",
      "activityType": {
        "name": "AcknowledgedPaymentActivity: acknowledgePayment"
      },
      "taskQueue": {
        "name": "payments",
        "kind": "Normal"
      },
      "header": null,
      "input": {
        "payloads": [
          {
            "metadata": {
              "encoding": "anNvbi9wcm90b2J1Zg=",
              "data": ""
          }
        ]
      },
      "scheduleToCloseTimeout": "7200s",
      "scheduleToStartTimeout": "7200s",
      "startToCloseTimeout": "7200s",
      "heartbeatTimeout": "0s",
      "workflowTaskCompletedEventId": "4",
      "retryPolicy": {
        "initialInterval": "1s",
        "backoffCoefficient": 2,
        "maximumInterval": "100s",
        "maximumAttempts": 20,
        "nonRetryableErrorTypes": []
      }
    }
  },

The activity specifies 2-hour StartToClose timeout. So if for whatever reason the worker that processes it fails during execution it is not going to be retried until 2 hours since its start pass. As ScheduleToCloseTimeout is also 2 hours activity is not retried. So any intermittent failure leads to the activity timeout after 2 hours.

The solution is to specify a meaningful StartToClose timeout that is as long as the longest activity attempt. If activity is indeed long running then specify a HeartbeatTimeout to detect worker crashes faster.

BTW. Don’t specify ScheduleToStartTimeout.

See Activities documentation that explains different timeout types.

I would also recommend “The 4 types of activity timeouts” blog post.

The fact is that I’m not specify the ScheduleToStart timeout, just the ScheduleToCloseTimeout. Perhaps the bug in php-sdk, but I’m not sure, I will check, thank you very much.

I see. It is benign as an activity cannot run longer than ScheduleToClose. But you do want to specify StartToClose.

I’m not sure that i was properly understand you, but here is code that i have:

$options = ActivityOptions::new()
            ->withScheduleToCloseTimeout(CarbonInterval::hours(2))
            ->withTaskQueue(WorkflowContext::TASK_QUEUE)
            ->withRetryOptions(
                RetryOptions::new()
                    ->withMaximumAttempts(20)
                    ->withBackoffCoefficient(2.0)
            );

As you can see, I only specified the ScheduleToCloseTimeout timeout. Why are the other timeouts set too? Is this a Temporal feature or a bug in php-sdk? If neither, should I set StartToCloseTimeout instead of ScheduleToCloseTimeout or specify both, but StartToCloseTimeout should be larger than ScheduleToCloseTimeout?

But I see the pattern. The activities that I set timeouts to below never crash because of ActivityTaskTimedOut:

->withScheduleToStartTimeout(CarbonInterval::hour())
->withStartToCloseTimeout(CarbonInterval::hour())

When you specify only ScheduleToStart the StartToClose defaults to the same value. I recommend never doing this unless you don’t want your activity ever retry.