In our use case we leverage a custom data converter to handle our custom encryption/decryption requirement for workflow payloads. As a result our custom data converter is invoked by a Workflow Task, however we are unable to control the retry policy for these tasks. As a result Temporal infinitely and aggressively retries the logic in our custom data converter, which led to a large incident that affected our company by taking down an internal service that was invoked from the custom data converter.
We are requesting that, similar to Activities, we are able to set a Retry Policy to instruct Temporal on how to retry the Workflow Tasks.
We landed 2 fixes that would mitigate this issue:
-
Backoff failed workflow task by yiminc · Pull Request #2548 · temporalio/temporal · GitHub this will prevent the aggressive retrying. The retry will still happen every 10s, but it won’t be a busy loop. This fix is available in server release v1.16.
-
Slow down workflow task retry by yycptt · Pull Request #2765 · temporalio/temporal · GitHub this will low down the retry further with exponential backoff up to 10m every retry. This fix will be available in server release v1.17.
We plan to add more control for this to be able to automatically pause workflow that can be resumed later when incident is mitigated.
1 Like