We are thinking about migrating from airflow to temporal. We probably run 100-150 dag runs max. There is one thing that detracts us from temporal - infinite retries. By default airflow will fail on every exception. I’ve tried to make it same in temporal with specifying next when creating a Python worker
workflow_failure_exception_types=[Exception],
But still in some cases like if we have a bug in the code the system is getting stuck(tries 1 time and pauses). How to mitigate that? Unfortunately I don’t know all the possible exceptions to make them non-retryable. What are the ways to solve this? Get a notification on every unexpected exception that happens in the system?
This is by design, workflow task is retried which allows you to detect the bug (with logs and sdk metrics) fix the code and the workflow will continue as if nothing had happened.
But still in some cases like if we have a bug in the code the system is getting stuck(tries 1 time and pauses)
It is not that it paused, workflow tasks are retried with an exponential backoff (up to 10 minutes) but retries are not written to the workflow history to avoid excessive growth of the event history, which could negatively impact performance and storage