Available ActivityWorker count dropping for timers

clintonb · September 2, 2024, 9:52pm

I have a dedicated worker service for a queue. There are two workflows:

StartAllPromotionWorkflows is responsible for getting active promotions (e.g., weekly discounts) and starting workflows via signal. This includes signaling already-running workflows. This runs every 15 minutes.
PromotionWorkflowV1 actually runs the promotion, which is outlined below.

PromotionWorkflowV1

Load the data from the database.
Sleep until the promotion starts.
Update the database to start the promotion (e.g., change the status field).
Sleep a few days to a couple years until the promotion ends.
Update the row in the database.
End.

We have about 475 of these workflows running, and are seeing issues with activities not starting quickly after being scheduled (e.g., many hours later, if at all). We are currently using UpdatableTimer, but also saw this issue when using standard sleep.

CPU utilization is rarely above 1%. We’ve increased max concurrent activities from 100 to 200, and now 1000. We get to zero available activity workers within a couple hours.

How can we resolve this? Continuing to increase maxConcurrentActivityTaskExecutions seems wrong, and a great way to end up with another incident when we forget to raise it again.

maxim · September 2, 2024, 10:39pm

Could you post an example history of PromotionWorkflowV1 with such delayed activity? You can remove all the payloads.

clintonb · September 3, 2024, 2:04am

Here you go: Event history to debug https://community.temporal.io/t/available-activityworker-count-dropping-for-timers/13367 · GitHub.

maxim · September 3, 2024, 5:06pm

Which specific activity had the issue?

clintonb · September 3, 2024, 6:39pm

Nearly all activities are blocked when we have not workers. In this specific instance, loadPromotion was blocked.

maxim · September 3, 2024, 6:52pm

This usually means that workers consume all the activity slots without releasing them back.

maxim · September 3, 2024, 7:04pm

This activity is failing with StartToClose timeout. See “attempt”: 5,". So the time is spent in retries.

{
      "eventId": "6",
      "eventTime": "2024-08-30T00:30:02.786733562Z",
      "eventType": "EVENT_TYPE_ACTIVITY_TASK_SCHEDULED",
      "version": "1166",
      "taskId": "292669809",
      "activityTaskScheduledEventAttributes": {
        "activityId": "1",
        "activityType": {
          "name": "loadPromotion"
        },
        "taskQueue": {
          "name": "promotions",
          "kind": "TASK_QUEUE_KIND_NORMAL"
        },
        "header": {
          "fields": {
            "_tracer-data": {
              "metadata": {
                "encoding": "anNvbi9wbGFpbg=="
              },
              "data": {
                "traceparent": "00-eed51dfb02d2ecc0247440375186b678-a89cee581f273550-01"
              }
            }
          }
        },
        "input": {
          "payloads": [
            {
              "metadata": {
                "encoding": "anNvbi9wbGFpbg==",
                "format": "ZXh0ZW5kZWQ="
              },
              "data": "2306"
            }
          ]
        },
        "scheduleToCloseTimeout": "0s",
        "scheduleToStartTimeout": "0s",
        "startToCloseTimeout": "60s",
        "heartbeatTimeout": "0s",
        "workflowTaskCompletedEventId": "5",
        "retryPolicy": {
          "initialInterval": "1s",
          "backoffCoefficient": 2,
          "maximumInterval": "100s",
          "nonRetryableErrorTypes": [
            "InvalidPromotionStateTransitionError"
          ]
        },
        "useWorkflowBuildId": true
      }
    },
    {
      "eventId": "7",
      "eventTime": "2024-08-30T00:35:37.918313613Z",
      "eventType": "EVENT_TYPE_ACTIVITY_TASK_STARTED",
      "version": "1166",
      "taskId": "292677819",
      "activityTaskStartedEventAttributes": {
        "scheduledEventId": "6",
        "identity": "1@temporal-worker-promotions-575c66dff6-5s6vd",
        "requestId": "820439d5-9767-447b-bf29-fd8d0b9f5c29",
        "attempt": 5,
        "lastFailure": {
          "message": "activity StartToClose timeout",
          "source": "Server",
          "timeoutFailureInfo": {
            "timeoutType": "TIMEOUT_TYPE_START_TO_CLOSE"
          }
        },
        "workerVersion": {
          "buildId": "@temporalio/worker@1.9.1+5d87e8ba6eeb7e208d8a791731a1547f48e138ad60bfd6575bb1fce9359ed8b0"
        }
      }
    },

clintonb · September 3, 2024, 8:02pm

Timeouts are in place so we have something to alert on. Without a timeout, we never know an activity is stuck until it finally starts and emits a schedule-to-start latency metric value.

How can we ensure workers are properly released?

maxim · September 3, 2024, 8:06pm

I don’t know. It looks like your activity function never returns.

clintonb · September 3, 2024, 8:37pm

How can determine which activity never returns?

I’m a bit confused because the workflows behave fine when the pods are restarted. Approximately all workflows are simply sleeping.

The only “active” workflows/activities are the signalers that run every 15 minutes, and complete in under 5s.

maxim · September 3, 2024, 8:59pm

When pods are restarted, all the workers start with all activity slots available. Over time, activities that don’t return consume more and more slots. Workers don’t poll for tasks when there are no slots available, so they “disappear” from the service.

maxim · September 3, 2024, 9:00pm

In Java, you would make a “thread dump” operation to see where all the threads are blocked. I’m not sure if Typescript has a similar feature.

Topic		Replies	Views
Worker does not start activity after restart Community Support go-sdk , retries , worker	17	3051	May 24, 2021
Understanding Workers with long-running Activities, and avoiding WorkflowTaskTimedOut on startToClose Community Support python-sdk	3	732	May 17, 2024
Activity scheduled but not started (need help) Community Support go-sdk	22	4627	June 27, 2022
Activity Scheduled But Not Starting Community Support go-sdk , activity	2	772	July 27, 2023
Some activities seem to be stuck & not starting Server Deployment	3	858	December 10, 2023

Available ActivityWorker count dropping for timers

Related topics