🐛 Issue: Workflow Task Timeout When Spawning Many Async Child Workflows

Hello everyone,

I’m encountering workflow task timeouts when using Temporal Promises to spawn a large number of asynchronous operations (activities or child workflows). Even after implementing a rate limiter, several child workflows still get stuck in the WORKFLOW_TASK_TIMED_OUT state.

Use Case

When the workflow starts, it triggers Activity A1, which fetches records from a downstream system.
For every record fetched, I start either:

  • a new activity, or

  • a new child workflow,

both asynchronously using Temporal Promises.

Example:
If Activity A1 retrieves 1000 records, I spin up 1000 async executions to process them in parallel.

Issue:
When these async tasks are being produced, the workflow task production rate becomes too high, and I often encounter workflow task timeouts.

Many child workflows get stuck in the “Workflow Task Timed Out” state. My assumption is that the worker is busy creating multiple tasks quickly, and it isn’t able to acknowledge these tasks in time — leading to timeouts.

To mitigate this, I implemented a custom rate limiter to allow only a few async executions (e.g., 50–100 at a time). After reaching that limit, the workflow sleeps(workflow.sleep()) for a few seconds to give existing tasks some buffer time to finish.

However, even with this rate limiter in place, around 87 child workflows still get stuck out of 1500 parallel executions. Even with 500 parallel executions, I saw 100+ child workflows stuck in the timeout state.

Many child workflow get stuck in the workflow task timed out state. My assumptions is the worker is busy producing multiple tasks, it is not able to acknowledge the created task and it fails by time out.

To solve this, I have implemented a custom rate limited for allow only few async executions(ex: 50-100) for a time limit and will sleep the workflow for few 10 of seconds, and the start the async assuming this sleep would have given some buffer time for the existing transactions to complete.

Even after the rate limiter, Exactly 87 child workflow get stuck in this state for around 1500 parallel executions. Even for 500 parallel transactions i have seen 100+ child workflow get stuck in this state.

I’m aware of the following options:

setMaxConcurrentActivityExecutionSize  
setMaxConcurrentWorkflowTaskExecutionSize  
setMaxWorkerActivitiesPerSecond  
setMaxTaskQueueActivitiesPerSecond

However — based on logs & workflow history (please refer screenshot)— the timeout occurs while creating the child workflow itself, before any activity is executed.
So this seems to happen during workflow task scheduling, not during activity execution.

Is there a recommended approach or best practice to avoid workflow task timeouts when spawning a large number of async child workflows or activities?

Are there alternative patterns for high-throughput workflow orchestration in Temporal’s newer version ?

Environment

Temporal Version: 1.21
SDK Version: 1.17
Language: Java

@maxim @tihomir

Thanks
Vishal

Hi Vishal,

Scheduling 1000 concurrent activities/child workflows (same workflow task) can cause the GRPC payload to exceed the default limit 4MB. (relevant documentation here gRPC: gRPC has a limit of 4 MB for each message received.)
Is this were the case, the task is rejected before it reached the temporal server and the server times out the workflow task
From v1.31 this error is not retried by the SDK and the task is marked as failed by the sdk.

Another possibility is that the task takes more than 10 seconds to run, this include workflow replay, payload serialization (and encryption) and network latency

Can you check workflow_task_execution_latency metric?

One approach is to use a tree structure, spawn <1k (ideally closer to 100-200 depending on the payload size) child workflows, which then themselves spawn child workflows.

Another approach is to schedule the child workflows/activities in batches, see this examples (iterator patter or slidingwindow should work for you)

However — based on logs & workflow history (please refer screenshot)— the timeout occurs while creating the child workflow itself , before any activity is executed.
So this seems to happen during workflow task scheduling , not during activity execution.

I assume those workflows are the one created by the parent which spawns 1000 child workflows. I am not sure this is related, can you share the workflow history of one of those workflows?

Hi @antonio.perez ,

We had this issue of grpc payload size exceeding 4 MB earlier and now have a compression logic to mitigate this.

Current Issue :

My parent workflow spawns the child workflows and the child may or maynot spawn activity or any further child workflows(its purely based on the path my child workflow takes - doesn’t matter what child workflow executes in this case). But for sure, the parent workflow has to spawn more than 700-800 child workflow during which I encounter this issue.

The rate limited I mentioned earlier is the custom sliding window mechanism where there is a counter which determines if the parent workflow can continue spawning child workflows or has to wait.

the workflow task occasionally gets stuck or times out, and not all child workflows complete.

But even after this sliding window mechanism the workflow task occasionally gets stuck or times out, and not all child workflows complete. The maximum window size is 90, with workflow wait() or sleep() between each window is 10 seconds.

Since the child workflow are stuck, the parent doesn’t wait for the child to complete. Parent get completed while child workflows are still running but stuck with timeout.

Screenshot of my Parent workflow.

**
Attaching** the one of the child workflow history.

Summary

{

“events”: [

{

“eventId”: “1”,

“eventTime”: “2025-11-19T09:52:40.668862182Z”,

“eventType”: “WorkflowExecutionStarted”,

“version”: “0”,

“taskId”: “1081593”,

“workerMayIgnore”: false,

“workflowExecutionStartedEventAttributes”: {

“workflowType”: {

“name”: “IContainerTemporalWorkflow”

},

“parentWorkflowNamespace”: “default”,

“parentWorkflowNamespaceId”: “e2b20f62-64b7-42ef-94dd-2a642320dfd2”,

“parentWorkflowExecution”: {

“workflowId”: “13bf910a-3e05-4907-87f5-1da34970c848”,

“runId”: “d0763f51-453f-4093-a35a-e8302e38c063”

},

“parentInitiatedEventId”: “3452”,

“taskQueue”: {

“name”: “workflow_sandbox_taskqueue_U-3G4327JUB2XES”,

“kind”: “Normal”

},

“input”: {

“payloads”: [

{

“metadata”: {

“converterType”: “Y29tLmU1LnBsYXRmb3JtLmNvcmUuY29udmVydGVyLkNvbnZlcnRlcldyYXBwZXI=”,

“encoding”: “anNvbi9wbGFpbg==”,

“valueType”: “amF2YS5sYW5nLlN0cmluZw==”

},

“data”: “ewogICJjb252ZXJ0ZWRTdHJpbmciIDogIkg0c0lBQUFBQUFBQS8xTXFTUzB1VVFJQXYyK0lNd1lBQUFBPSIsCiAgInJlZmVyZW5jZUlkIiA6ICJhNDhjN2Q4NS1lOWJhLTQ3N2YtODk1My01MzU2YmVhM2I3ZDMiLAogICJjb252ZXJ0ZWRTdHJpbmdBYm92ZTJNQiIgOiBmYWxzZQp9”

},

{

“metadata”: {

“converterType”: “Y29tLmU1LnBsYXRmb3JtLmNvcmUuY29udmVydGVyLkNvbnZlcnRlcldyYXBwZXI=”,

“encoding”: “anNvbi9wbGFpbg==”,

“valueType”: “Y29tLmU1LnBsYXRmb3JtLmNvcmUud3JlLnRlc3R3Zi5yYXRlTGltaXRlcldvcmtmbG93LmNvbnRhaW5lci5UcmFuc2FjdGlvbkNvbnRhaW5lciRNeUN1c3RvbUNvbnRhaW5lck9uZQ==”

},

“data”: “ewogICJjb252ZXJ0ZWRTdHJpbmciIDogIkg0c0lBQUFBQUFBQS82dFdLczh2eWs3THlTOTN6czhyU2Ewb1ViTEtLODNKMFZGS0xLN01TM2F0U0UwdUxja3ZnZ21XcHdXWEpKYWtCcWNXbFdVbXAwSkVhd0VFZG1EZVF3QUFBQT09IiwKICAicmVmZXJlbmNlSWQiIDogIjc3ZDFmZDk0LWFmYzItNDFiYi1iZDIxLTM4NGI3Y2Y3N2QyMSIsCiAgImNvbnZlcnRlZFN0cmluZ0Fib3ZlMk1CIiA6IGZhbHNlCn0=”

},

{

“metadata”: {

“converterType”: “Y29tLmU1LnBsYXRmb3JtLmNvcmUuY29udmVydGVyLkNvbnZlcnRlcldyYXBwZXI=”,

“encoding”: “anNvbi9wbGFpbg==”,

“valueType”: “Y29tLmU1LnBsYXRmb3JtLmNvcmUud29ya2Zsb3cuV29ya2Zsb3dDb250ZXh0”

},

“data”: “ewogICJjb252ZXJ0ZWRTdHJpbmciIDogIkg0c0lBQUFBQUFBQS8zVlRYVytqTUJEOEx6ekhsY0Y4T1c5cG02dnlrcDRTcmoxZFZVWEdYa2ZXZ2VGc295YXErdC9QMEpDZ1MwOUNDTzk0ZG1kM2gvZmdyVEcvWmRXOHJiUnNndm43K2J4bU5RVHpZTXUwS0pzRGVqNkZnOW1GSVR4ZVNoSFJzb3k0Z0FReUtrbVpoVmtzUWtGREtZVVEwL3ZhT3FZNUREd0FEd05KRWZBd1FvVHdET1ZobWZwWGhPTkVwcHpUYk1KOVlrWXg3ZjRqQ3AzZ2E4SlFLMHRpU2xPYWlqeU9jaHJMVEphY3BZU21HYU5aQ2NrMTdRbU1WWTMyM1BBR1QrRXhycnVxdW1MZGRxb1M2NjR1d1l3M2VLT2wyditiem9HK2JxWVlnaDV0bVFIdDdocnQ0T0RHUkE3cXRqR3NlcDRPUHlTbHBDRm1pQUJPVUV5eEgySW1FeFFLUm1LYVlaN0hlWEROM1hSNm9BdWNwVVFtSVlvVElsR01LVUdNSkF4QlRuQUVKT2M0SlJONnI5ZTJqUGVpQlVqV1ZkT0JGMGJ0OTJDS1l6czB0VmpmM3o3KzNCWExiYkVyTnF1SGgrWG1pOHRxbUVDRUk2ODVSQ0V0TUozSHVYOXUwaWhLY1pURzVOZkxqK0x1MVhPVmJqdlhPOVN5dXExZzFSKy9LYWo2VGd4enNLdFVyUnlZWFk3N0dldGhENDl5WVkrYUx3L0FPK2RYWUlPNWh6OW12b0cyYW80MW5IYzlUR1RjeGYwWlJlUHVMdUxIOFYyNTBBUFRIc0hVU250aEcvalRnWFhnT1pKVkZtYUJWWHZOcWtYYmJwM0hnL25McTQ5OVp2dk9PZ3RUeEd2MUVYTzJ3M3Z2cWRPbngvd3Y1VG83dWtSNXhKaXU5Y1Uyd096RnFUN2FuRXo1OFhVdE94VDdDK1VkTDRRUUJBQUEiLAogICJyZWZlcmVuY2VJZCIgOiAiNzc4ZDhhN2EtNTE1OS00N2NkLWI3NmUtMDYwMGYwN2E3ZGZiIiwKICAiY29udmVydGVkU3RyaW5nQWJvdmUyTUIiIDogZmFsc2UKfQ==”

},

{

“metadata”: {

“converterType”: “Y29tLmU1LnBsYXRmb3JtLmNvcmUuY29udmVydGVyLkNvbnZlcnRlcldyYXBwZXI=”,

“encoding”: “anNvbi9wbGFpbg==”,

“valueType”: “Y29tLmU1LnBsYXRmb3JtLmNvcmUuZXZlbnQuQ29udGFpbmVyTWV0YWRhdGE=”

},

“data”: “ewogICJjb252ZXJ0ZWRTdHJpbmciIDogIkg0c0lBQUFBQUFBQS8wMk9PdzdDTUJCRTcrS2FTSTQvV1pzV21qUlVYR0M5bVpWUzRFaEpxQkIzeDBKQ29aNlo5K1psWktrN3p4WHJPSm16WVJyRWkzb2tjalpDZlN5bFp3bUVQZ3Q2OUluRVJjMVpNZ3Byd2hBbGxBSlJTaFNETTZjRGVPTUhHdksrY3QxWTlubXBsMS8wWHh2cnRuTVZmUDJheTBBSVU4Y1dyZ3VDMURWSDZJS2RIQld2MXViWXhzOE42eFhhNXRQbCtJKzZ6em8zK3RuUTRNMzdBeHZ4SzRmZEFBQUEiLAogICJyZWZlcmVuY2VJZCIgOiAiODNkNTJmNDUtNTgyMy00NjQxLWI4MDYtZjU3ZDM3MzE0OTZiIiwKICAiY29udmVydGVkU3RyaW5nQWJvdmUyTUIiIDogZmFsc2UKfQ==”

}

]

},

“workflowExecutionTimeout”: “0s”,

“workflowRunTimeout”: “0s”,

“workflowTaskTimeout”: “10s”,

“continuedExecutionRunId”: “”,

“initiator”: “Unspecified”,

“continuedFailure”: null,

“lastCompletionResult”: null,

“originalExecutionRunId”: “0d15f4c6-2421-446c-9950-0c77e4ec8b13”,

“identity”: “”,

“firstExecutionRunId”: “0d15f4c6-2421-446c-9950-0c77e4ec8b13”,

“retryPolicy”: null,

“attempt”: 1,

“workflowExecutionExpirationTime”: null,

“cronSchedule”: “”,

“firstWorkflowTaskBackoff”: “0s”,

“memo”: null,

“searchAttributes”: null,

“prevAutoResetPoints”: null,

“header”: {

“fields”: {}

},

“parentInitiatedEventVersion”: “0”

}

},

{

“eventId”: “2”,

“eventTime”: “2025-11-19T09:52:40.688858005Z”,

“eventType”: “WorkflowTaskScheduled”,

“version”: “0”,

“taskId”: “1081611”,

“workerMayIgnore”: false,

“workflowTaskScheduledEventAttributes”: {

“taskQueue”: {

“name”: “workflow_sandbox_taskqueue_U-3G4327JUB2XES”,

“kind”: “Normal”

},

“startToCloseTimeout”: “10s”,

“attempt”: 1

}

},

{

“eventId”: “3”,

“eventTime”: “2025-11-19T09:52:40.869443130Z”,

“eventType”: “WorkflowTaskStarted”,

“version”: “0”,

“taskId”: “1081805”,

“workerMayIgnore”: false,

“workflowTaskStartedEventAttributes”: {

“scheduledEventId”: “2”,

“identity”: “7623@U-3G4327JUB2XES”,

“requestId”: “b27d5c40-b4c4-4c34-87c5-db2aa0e60e05”,

“suggestContinueAsNew”: false,

“historySizeBytes”: “2652”

}

},

{

“eventId”: “4”,

“eventTime”: “2025-11-19T09:52:40.944490597Z”,

“eventType”: “WorkflowTaskCompleted”,

“version”: “0”,

“taskId”: “1081873”,

“workerMayIgnore”: false,

“workflowTaskCompletedEventAttributes”: {

“scheduledEventId”: “2”,

“startedEventId”: “3”,

“identity”: “7623@U-3G4327JUB2XES”,

“binaryChecksum”: “”,

“workerVersioningId”: null,

“sdkMetadata”: null,

“meteringMetadata”: null

}

},

{

“eventId”: “5”,

“eventTime”: “2025-11-19T09:52:40.944512101Z”,

“eventType”: “MarkerRecorded”,

“version”: “0”,

“taskId”: “1081874”,

“workerMayIgnore”: false,

“markerRecordedEventAttributes”: {

“markerName”: “SideEffect”,

“details”: {

“data”: {

“payloads”: [

{

“metadata”: {

“converterType”: “Y29tLmU1LnBsYXRmb3JtLmNvcmUuY29udmVydGVyLkNvbnZlcnRlcldyYXBwZXI=”,

“encoding”: “anNvbi9wbGFpbg==”,

“valueType”: “amF2YS5sYW5nLlN0cmluZw==”

},

“data”: “ewogICJjb252ZXJ0ZWRTdHJpbmciIDogIkg0c0lBQUFBQUFBQS8xTnk5dmNMY2ZUMGN3MktEdzV4REFweGRWRUNBQ1FucURnVEFBQUEiLAogICJyZWZlcmVuY2VJZCIgOiAiMTI5OTZkMGMtYWM1ZS00ZGYzLWExYjYtOWQyNmI2NTNmMGU4IiwKICAiY29udmVydGVkU3RyaW5nQWJvdmUyTUIiIDogZmFsc2UKfQ==”

}

]

}

},

“workflowTaskCompletedEventId”: “4”,

“header”: null,

“failure”: null

}

},

{

“eventId”: “6”,

“eventTime”: “2025-11-19T09:52:40.944533351Z”,

“eventType”: “MarkerRecorded”,

“version”: “0”,

“taskId”: “1081875”,

“workerMayIgnore”: false,

“markerRecordedEventAttributes”: {

“markerName”: “SideEffect”,

“details”: {

“data”: {

“payloads”: [

{

“metadata”: {

“converterType”: “Y29tLmU1LnBsYXRmb3JtLmNvcmUuY29udmVydGVyLkNvbnZlcnRlcldyYXBwZXI=”,

“encoding”: “anNvbi9wbGFpbg==”,

“valueType”: “amF2YS5sYW5nLlN0cmluZw==”

},

“data”: “ewogICJjb252ZXJ0ZWRTdHJpbmciIDogIkg0c0lBQUFBQUFBQS8xTUtkL1FNaVhmeThYZjJqZzhPY1F3S2NYVlJBZ0RNTzJiY0ZBQUFBQT09IiwKICAicmVmZXJlbmNlSWQiIDogIjMzNGNhNzUxLWRhNTQtNDgwMi04ODlmLTI4YzYyZjhjZDRjMSIsCiAgImNvbnZlcnRlZFN0cmluZ0Fib3ZlMk1CIiA6IGZhbHNlCn0=”

}

]

}

},

“workflowTaskCompletedEventId”: “4”,

“header”: null,

“failure”: null

}

},

{

“eventId”: “7”,

“eventTime”: “2025-11-19T09:52:40.944536728Z”,

“eventType”: “TimerStarted”,

“version”: “0”,

“taskId”: “1081876”,

“workerMayIgnore”: false,

“timerStartedEventAttributes”: {

“timerId”: “10d050d7-58c7-361b-bd51-ddbf441f1ab2”,

“startToFireTimeout”: “10s”,

“workflowTaskCompletedEventId”: “4”

}

},

{

“eventId”: “8”,

“eventTime”: “2025-11-19T09:52:50.945781011Z”,

“eventType”: “TimerFired”,

“version”: “0”,

“taskId”: “1082245”,

“workerMayIgnore”: false,

“timerFiredEventAttributes”: {

“timerId”: “10d050d7-58c7-361b-bd51-ddbf441f1ab2”,

“startedEventId”: “7”

}

},

{

“eventId”: “9”,

“eventTime”: “2025-11-19T09:52:50.945787274Z”,

“eventType”: “WorkflowTaskScheduled”,

“version”: “0”,

“taskId”: “1082246”,

“workerMayIgnore”: false,

“workflowTaskScheduledEventAttributes”: {

“taskQueue”: {

“name”: “7623@U-3G4327JUB2XES:400361ee-6740-4d7b-9cb0-67949db37182”,

“kind”: “Sticky”

},

“startToCloseTimeout”: “10s”,

“attempt”: 1

}

},

{

“eventId”: “10”,

“eventTime”: “2025-11-19T09:52:55.947561064Z”,

“eventType”: “WorkflowTaskTimedOut”,

“version”: “0”,

“taskId”: “1082537”,

“workerMayIgnore”: false,

“workflowTaskTimedOutEventAttributes”: {

“scheduledEventId”: “9”,

“startedEventId”: “0”,

“timeoutType”: “ScheduleToStart”

}

},

{

“eventId”: “11”,

“eventTime”: “2025-11-19T09:52:55.947566121Z”,

“eventType”: “WorkflowTaskScheduled”,

“version”: “0”,

“taskId”: “1082538”,

“workerMayIgnore”: false,

“workflowTaskScheduledEventAttributes”: {

“taskQueue”: {

“name”: “workflow_sandbox_taskqueue_U-3G4327JUB2XES”,

“kind”: “Normal”

},

“startToCloseTimeout”: “10s”,

“attempt”: 1

}

}

]

}

Requesting your help on this.

Thanks

        {
            "eventId": "9",
            "eventTime": "2025-11-19T09: 52: 50.945787274Z",
            "eventType": "WorkflowTaskScheduled",
            "version": "0",
            "taskId": "1082246",
            "workerMayIgnore": false,
            "workflowTaskScheduledEventAttributes": {
                "taskQueue": {
                    "name": "7623@U-3G4327JUB2XES:400361ee-6740-4d7b-9cb0-67949db37182",
                    "kind": "Sticky"
                },
                "startToCloseTimeout": "10s",
                "attempt": 1
            }
        },
        {
            "eventId": "10",
            "eventTime": "2025-11-19T09: 52: 55.947561064Z",
            "eventType": "WorkflowTaskTimedOut",
            "version": "0",
            "taskId": "1082537",
            "workerMayIgnore": false,
            "workflowTaskTimedOutEventAttributes": {
                "scheduledEventId": "9",
                "startedEventId": "0",
                "timeoutType": "ScheduleToStart"
            }
        },

This sticky task timeout, a Workflow Task sticky timeout happens when a Workflow Task that was scheduled on a sticky task queue is not picked up by that specific Worker within the configured StickyScheduleToStartTimeout (5 seconds by default)

  • The sticky Worker crashed or was shut down right after the task was scheduled, so it never polled its sticky queue.
  • The Worker is overloaded and doesn’t poll the sticky queue within the sticky timeout window.

Then the next workflow task is schedule and apparently never picked up by the worker. You can use this guide as a reference Worker performance | Temporal Platform Documentation, but I would start checking