Available ActivityWorker count dropping for timers

I have a dedicated worker service for a queue. There are two workflows:

  1. StartAllPromotionWorkflows is responsible for getting active promotions (e.g., weekly discounts) and starting workflows via signal. This includes signaling already-running workflows. This runs every 15 minutes.
  2. PromotionWorkflowV1 actually runs the promotion, which is outlined below.

PromotionWorkflowV1

  1. Load the data from the database.
  2. Sleep until the promotion starts.
  3. Update the database to start the promotion (e.g., change the status field).
  4. Sleep a few days to a couple years until the promotion ends.
  5. Update the row in the database.
  6. End.

We have about 475 of these workflows running, and are seeing issues with activities not starting quickly after being scheduled (e.g., many hours later, if at all). We are currently using UpdatableTimer, but also saw this issue when using standard sleep.

CPU utilization is rarely above 1%. We’ve increased max concurrent activities from 100 to 200, and now 1000. We get to zero available activity workers within a couple hours.

How can we resolve this? Continuing to increase maxConcurrentActivityTaskExecutions seems wrong, and a great way to end up with another incident when we forget to raise it again.


Could you post an example history of PromotionWorkflowV1 with such delayed activity? You can remove all the payloads.

Here you go: Event history to debug https://community.temporal.io/t/available-activityworker-count-dropping-for-timers/13367 · GitHub.

Which specific activity had the issue?

Nearly all activities are blocked when we have not workers. In this specific instance, loadPromotion was blocked.

This usually means that workers consume all the activity slots without releasing them back.

This activity is failing with StartToClose timeout. See “attempt”: 5,". So the time is spent in retries.

{
      "eventId": "6",
      "eventTime": "2024-08-30T00:30:02.786733562Z",
      "eventType": "EVENT_TYPE_ACTIVITY_TASK_SCHEDULED",
      "version": "1166",
      "taskId": "292669809",
      "activityTaskScheduledEventAttributes": {
        "activityId": "1",
        "activityType": {
          "name": "loadPromotion"
        },
        "taskQueue": {
          "name": "promotions",
          "kind": "TASK_QUEUE_KIND_NORMAL"
        },
        "header": {
          "fields": {
            "_tracer-data": {
              "metadata": {
                "encoding": "anNvbi9wbGFpbg=="
              },
              "data": {
                "traceparent": "00-eed51dfb02d2ecc0247440375186b678-a89cee581f273550-01"
              }
            }
          }
        },
        "input": {
          "payloads": [
            {
              "metadata": {
                "encoding": "anNvbi9wbGFpbg==",
                "format": "ZXh0ZW5kZWQ="
              },
              "data": "2306"
            }
          ]
        },
        "scheduleToCloseTimeout": "0s",
        "scheduleToStartTimeout": "0s",
        "startToCloseTimeout": "60s",
        "heartbeatTimeout": "0s",
        "workflowTaskCompletedEventId": "5",
        "retryPolicy": {
          "initialInterval": "1s",
          "backoffCoefficient": 2,
          "maximumInterval": "100s",
          "nonRetryableErrorTypes": [
            "InvalidPromotionStateTransitionError"
          ]
        },
        "useWorkflowBuildId": true
      }
    },
    {
      "eventId": "7",
      "eventTime": "2024-08-30T00:35:37.918313613Z",
      "eventType": "EVENT_TYPE_ACTIVITY_TASK_STARTED",
      "version": "1166",
      "taskId": "292677819",
      "activityTaskStartedEventAttributes": {
        "scheduledEventId": "6",
        "identity": "1@temporal-worker-promotions-575c66dff6-5s6vd",
        "requestId": "820439d5-9767-447b-bf29-fd8d0b9f5c29",
        "attempt": 5,
        "lastFailure": {
          "message": "activity StartToClose timeout",
          "source": "Server",
          "timeoutFailureInfo": {
            "timeoutType": "TIMEOUT_TYPE_START_TO_CLOSE"
          }
        },
        "workerVersion": {
          "buildId": "@temporalio/worker@1.9.1+5d87e8ba6eeb7e208d8a791731a1547f48e138ad60bfd6575bb1fce9359ed8b0"
        }
      }
    },

Timeouts are in place so we have something to alert on. Without a timeout, we never know an activity is stuck until it finally starts and emits a schedule-to-start latency metric value.

How can we ensure workers are properly released?

I don’t know. It looks like your activity function never returns.

How can determine which activity never returns?

I’m a bit confused because the workflows behave fine when the pods are restarted. Approximately all workflows are simply sleeping.

The only “active” workflows/activities are the signalers that run every 15 minutes, and complete in under 5s.

When pods are restarted, all the workers start with all activity slots available. Over time, activities that don’t return consume more and more slots. Workers don’t poll for tasks when there are no slots available, so they “disappear” from the service.

In Java, you would make a “thread dump” operation to see where all the threads are blocked. I’m not sure if Typescript has a similar feature.