Hey,
I am asking about “workflow task retry” because this mechanism seems hard to understand and insidious to our applications. What I find now is : if I don’t catch all runtime exception and set them to application.non-retry exception. Any exception unexpected will be forever retried.
The reason behind this design seems to be “so that we can fix the bug and let the workflow retry to success”. But as an application developer, this is not very convincing. And if I put activity retry and execution retry policy and default behavior into account, I get myself totally lost. I try to figure out a clear view of the retries:
My workflow definition is made up of
- the pure orchestration
- the business logic come together with orchestration
- the atomic business logic wrapped with Activity.
I don’t think we can leave 2) out of place, because the orchestration definitely contains our application’s logic and business, we cannot write the control flow without that.
So now we have 3 kinds of targets to retry:
- pure orchestration
- business logic more control flow focused
- normal business logic
For 2) and 3), as long as there is business code, there is high possibility there are business errors coming from dirty data or bugs produced from our fast sprints that we fail to discover beforehand. In application development with complex systems, this can hardly be avoided.
When coming to the topic of application unexpected errors, I think application developers need full control over this. I find I have full control about this on activity retry policy:
- I try to find all the business errors and define them as non-retryable (A)
- and if I fail to find them, the max-retry will stop the vain (B)
But if the logic is in definition, I feel not sure, I lose the whole control of (B). Anything unexpected will make the workers forever try, and may even hinder other normal executions because the retry is very fast and a small group of dirty data may drain our workers, and we must work fast to fix the code or clean the dirty database. And the cloud doesn’t support batch termination makes our system worse.
I kind of understand this is tricky and hard to handle because workflow task retry handles both “pure orchestration” responsibility and “business control flow”. And it is hard for temporal SDK to split them, and (B) is deprived from us so that “pure orchestration” can be guaranteed.
However, after 7 years of application backend engineering experience (I worked for TikTok and Alibaba), I always think the first priority is to let the broken thing stop first and resume after human’s repair.
So what can I do to attain this, I have several tries here and please feel free to give feedbacks if I understand this topic totally wrong or my tricks are not valid at all (really appreciate) :
- try {all} catch {throwable} and turn exception into Application Non-retry Exception manually (makes the code look bad)
- Put more orchestration related logic into local activity which is actually split the business control flow from pure orchestration
This seems to be a complex topic and I may not be able to describe this problem very clearly. Please let me know if some points need more explanation.
=====================
Plus:
- “retry-as-new” from the broken point seems to be something we can use to restore our process. So we don’t lose the workflow even if we fail them
- if the application system cares more on time, the problem becomes worse. For example, after a user produces a video on some app, several steps are taken to actually generate the video so that the user can see the video soon on his/her app. This kind of async job should be finished in less than 1min, and if some corrupt data blocks all our workers with retry, things are dangerous.
so the third option is open up to user configuration for workflow task retry policy?