I always think the first priority is to let the broken thing stop first and resume after human’s repair.
This is exactly the intent. The workflow doesn’t fail, but it is blocked until the bug is resolved. I agree that the original implementation of retrying the workflow task without backoff is not perfect.
- Release v1.17.0 added backoff logic to workflow task retries. See pull request #2765.
- We will add the ability to automatically suspend workflow execution after a certain number of workflow task retries. The batch command to unsuspend will be supported.
Note that you can configure your workflows to fail on any specific exception type. If you specify Throwable as such exception type, the workflow will fail on any unhandled exception.
And the cloud doesn’t support batch termination makes our system worse.
Batch will be supported in the cloud later this year.