I am working on a project in micro services architecture with multiple temporal workers (one of each per service). The configuration of these workers and clients is standard across all services, with some interceptors for logging and context propagation of trace ids, which are running on independent docker containers.
Recently I’ve started defining workflow execution timeouts for many workflows in the project, and started noticing that some of the workflows are occasionally stalling for a large amount of time (over 30 seconds) which causes them to fail due to timeout. The normal execution time for these workflows should be less than 5 seconds.
I have also defined activity start to close timeouts, but the activities don’t exhibit this stalling behavior, or at least no activity has timed out so far. And all of this is running with the same retry policy of only 1 attempt (so no retries if anything fails).
The stalling behavior is occuring during integration testing and is inconsistent, only happens sometimes while executing tests. When it happens I see the log of the workflow being called (which comes from my log interceptor) and then nothing for 30 seconds at which point the workflow timeout is reached. During normal execution I’d be seeing logs of all the activities being called (also intercepted) and also internal logs of the work being done by these activities. This happens both locally in my machine and in my devops pipeline with jenkins, so it shouldn’t be related to system resources.
Any ideas of what could be causing this?
Coul you post a history for such a timed-out workflow?
Yes. I’ve just censored the named fields as they may include some sensitive data, but they should not be relevant for this issue
And in the json transcript (also edited to remove payload and workflow name):
{ "events": [{"eventId":"1","eventTime":"2023-11-20T10:46:54.000Z","eventType":"WorkflowExecutionStarted","version":"0","taskId":"1049195","workflowExecutionStartedEventAttributes":{"workflowType":{"name":""},"parentWorkflowNamespace":"","parentInitiatedEventId":"0","taskQueue":{"name":"TC","kind":"Normal"},"input":{"payloads":[{"metadata":{"encoding":"anNvbi9wbGFpbg=="},"data":""}]},"workflowExecutionTimeout":"30s","workflowRunTimeout":"30s","workflowTaskTimeout":"10s","continuedExecutionRunId":"","initiator":"Unspecified","originalExecutionRunId":"7742fab5-8a92-4422-977b-aa185a5f5963","identity":"10557@514304e7dfd0@","firstExecutionRunId":"7742fab5-8a92-4422-977b-aa185a5f5963","retryPolicy":{"nonRetryableErrorTypes":[],"initialInterval":"1s","backoffCoefficient":2,"maximumInterval":"30s","maximumAttempts":1},"attempt":1,"workflowExecutionExpirationTime":"2023-11-20T10:47:24.000Z","cronSchedule":"","firstWorkflowTaskBackoff":"0s","header":{"fields":{"custom-header":{"metadata":{"encoding":"anNvbi9wbGFpbg=="},"data":"eyJUcmFjZUlkIjoiN2M0NmEyY2I1YjI4OWE1MWNhNjQ3NmNiYmZkN2ExZGQiLCJTcGFuSWQiOiI2NjJkMjlmNWM4NTEwYjA4In0="},"_tracer-data":{"metadata":{"encoding":"anNvbi9wbGFpbg=="},"data":"eyJ0cmFjZXBhcmVudCI6IjAwLTdjNDZhMmNiNWIyODlhNTFjYTY0NzZjYmJmZDdhMWRkLWNkYWQwY2VhOTdhZmM2MTktMDEifQ=="}}}}},{"eventId":"2","eventTime":"2023-11-20T10:46:54.000Z","eventType":"WorkflowTaskScheduled","version":"0","taskId":"1049196","workflowTaskScheduledEventAttributes":{"taskQueue":{"name":"TC","kind":"Normal"},"startToCloseTimeout":"10s","attempt":1}},{"eventId":"3","eventTime":"2023-11-20T10:47:24.000Z","eventType":"WorkflowExecutionTimedOut","version":"0","taskId":"1049202","workflowExecutionTimedOutEventAttributes":{"retryState":"Timeout","newExecutionRunId":""}}] }
An update: it seems that running multiple workers for each service has mitigated or solved the stalling issue. For some unknown reason a single worker was sometimes stalling between scheduling and executing activities, even though the project is nowhere near reaching the default limits for concurrent tasks and doesn’t seem to be resource starved.