Temporal workflow seems to stall before starting execution

raphabaruffi · November 17, 2023, 12:08pm

I am working on a project in micro services architecture with multiple temporal workers (one of each per service). The configuration of these workers and clients is standard across all services, with some interceptors for logging and context propagation of trace ids, which are running on independent docker containers.
Recently I’ve started defining workflow execution timeouts for many workflows in the project, and started noticing that some of the workflows are occasionally stalling for a large amount of time (over 30 seconds) which causes them to fail due to timeout. The normal execution time for these workflows should be less than 5 seconds.
I have also defined activity start to close timeouts, but the activities don’t exhibit this stalling behavior, or at least no activity has timed out so far. And all of this is running with the same retry policy of only 1 attempt (so no retries if anything fails).
The stalling behavior is occuring during integration testing and is inconsistent, only happens sometimes while executing tests. When it happens I see the log of the workflow being called (which comes from my log interceptor) and then nothing for 30 seconds at which point the workflow timeout is reached. During normal execution I’d be seeing logs of all the activities being called (also intercepted) and also internal logs of the work being done by these activities. This happens both locally in my machine and in my devops pipeline with jenkins, so it shouldn’t be related to system resources.
Any ideas of what could be causing this?

maxim · November 19, 2023, 5:47pm

Coul you post a history for such a timed-out workflow?

raphabaruffi · November 20, 2023, 10:59am

Yes. I’ve just censored the named fields as they may include some sensitive data, but they should not be relevant for this issue

And in the json transcript (also edited to remove payload and workflow name):

{ "events": [{"eventId":"1","eventTime":"2023-11-20T10:46:54.000Z","eventType":"WorkflowExecutionStarted","version":"0","taskId":"1049195","workflowExecutionStartedEventAttributes":{"workflowType":{"name":""},"parentWorkflowNamespace":"","parentInitiatedEventId":"0","taskQueue":{"name":"TC","kind":"Normal"},"input":{"payloads":[{"metadata":{"encoding":"anNvbi9wbGFpbg=="},"data":""}]},"workflowExecutionTimeout":"30s","workflowRunTimeout":"30s","workflowTaskTimeout":"10s","continuedExecutionRunId":"","initiator":"Unspecified","originalExecutionRunId":"7742fab5-8a92-4422-977b-aa185a5f5963","identity":"10557@514304e7dfd0@","firstExecutionRunId":"7742fab5-8a92-4422-977b-aa185a5f5963","retryPolicy":{"nonRetryableErrorTypes":[],"initialInterval":"1s","backoffCoefficient":2,"maximumInterval":"30s","maximumAttempts":1},"attempt":1,"workflowExecutionExpirationTime":"2023-11-20T10:47:24.000Z","cronSchedule":"","firstWorkflowTaskBackoff":"0s","header":{"fields":{"custom-header":{"metadata":{"encoding":"anNvbi9wbGFpbg=="},"data":"eyJUcmFjZUlkIjoiN2M0NmEyY2I1YjI4OWE1MWNhNjQ3NmNiYmZkN2ExZGQiLCJTcGFuSWQiOiI2NjJkMjlmNWM4NTEwYjA4In0="},"_tracer-data":{"metadata":{"encoding":"anNvbi9wbGFpbg=="},"data":"eyJ0cmFjZXBhcmVudCI6IjAwLTdjNDZhMmNiNWIyODlhNTFjYTY0NzZjYmJmZDdhMWRkLWNkYWQwY2VhOTdhZmM2MTktMDEifQ=="}}}}},{"eventId":"2","eventTime":"2023-11-20T10:46:54.000Z","eventType":"WorkflowTaskScheduled","version":"0","taskId":"1049196","workflowTaskScheduledEventAttributes":{"taskQueue":{"name":"TC","kind":"Normal"},"startToCloseTimeout":"10s","attempt":1}},{"eventId":"3","eventTime":"2023-11-20T10:47:24.000Z","eventType":"WorkflowExecutionTimedOut","version":"0","taskId":"1049202","workflowExecutionTimedOutEventAttributes":{"retryState":"Timeout","newExecutionRunId":""}}] }

raphabaruffi · February 27, 2024, 9:38am

An update: it seems that running multiple workers for each service has mitigated or solved the stalling issue. For some unknown reason a single worker was sometimes stalling between scheduling and executing activities, even though the project is nowhere near reaching the default limits for concurrent tasks and doesn’t seem to be resource starved.

Topic		Replies	Views
Workflow task timeout after activity is completed Community Support go-sdk	5	51	April 21, 2025
Activity stuck and not processing until start-to-close timeout is hit Community Support activity , timeout	5	627	June 7, 2024
WorkflowExecutionTimeout issue Community Support java-sdk	5	865	August 12, 2021
Temporal workflows getting failed due to i/o timeout Server Deployment go-sdk	2	293	April 15, 2024
Workflow scheduling takes more than 30 seconds Community Support go-sdk	5	945	September 8, 2022

Temporal workflow seems to stall before starting execution

Related topics