We are observing WorkflowTaskTimedOut errors when multiple workflows are triggered at once.
we are using a parent and child workflow and these timeout errors are being observed in both parent and child.
workflow code is a simple code calling list of multiple activities.
Only piece of executable work we are doing in parent workflow is to get the workflow started time using
start_date_time = workflow.now().strftime(‘%Y-%m-%d %H:%M’)
end_date_time = (datetime.strptime(start_date_time, ‘%Y-%m-%d %H:%M’) + timedelta(hours=2))
.strftime(‘%Y-%m-%d %H:%M’)
we have not set any timeout and hence defaults are used.
This is expected when the timed out workflow task response is sent by your worker.
The WorkflowTaskTimedOut timeout type is StartToClose meaning your worker is not able to respond its completion within the default timeout of 10s.
Still think it’s important to check the earlier mentioned sdk (worker) metrics as well as check your worker pods/containers cpu and mem utilization during this time. This would allow you to pinpoint something rather than guessing.
When ran in debug mode saw this error in the logs just before the workflow tasksTimed out.
DEBUG:temporalio.worker._workflow:Evicting workflow with run ID e5af6845-296d-4b15-a936-dc5bdd2274e1, message: Error reporting WFT to server
DEBUG:temporalio.worker._workflow:Evicting workflow with run ID cb857635-1cac-4562-ab1e-8ee1a85d7a9f, message: Error reporting WFT to server
Could you look at your worker metric request_failure
you can filter it by operation and status_code. Look if you see any errors reported for operation RespondWorkflowTaskCompleted and see the status code.