Workflow task retry seems to be dangerous and hard to understand, and what are application's best practice?

Hey,

I am asking about “workflow task retry” because this mechanism seems hard to understand and insidious to our applications. What I find now is : if I don’t catch all runtime exception and set them to application.non-retry exception. Any exception unexpected will be forever retried.

The reason behind this design seems to be “so that we can fix the bug and let the workflow retry to success”. But as an application developer, this is not very convincing. And if I put activity retry and execution retry policy and default behavior into account, I get myself totally lost. I try to figure out a clear view of the retries:

My workflow definition is made up of

  1. the pure orchestration
  2. the business logic come together with orchestration
  3. the atomic business logic wrapped with Activity.

I don’t think we can leave 2) out of place, because the orchestration definitely contains our application’s logic and business, we cannot write the control flow without that.

So now we have 3 kinds of targets to retry:

  1. pure orchestration
  2. business logic more control flow focused
  3. normal business logic

For 2) and 3), as long as there is business code, there is high possibility there are business errors coming from dirty data or bugs produced from our fast sprints that we fail to discover beforehand. In application development with complex systems, this can hardly be avoided.

When coming to the topic of application unexpected errors, I think application developers need full control over this. I find I have full control about this on activity retry policy:

  • I try to find all the business errors and define them as non-retryable (A)
  • and if I fail to find them, the max-retry will stop the vain (B)

But if the logic is in definition, I feel not sure, I lose the whole control of (B). Anything unexpected will make the workers forever try, and may even hinder other normal executions because the retry is very fast and a small group of dirty data may drain our workers, and we must work fast to fix the code or clean the dirty database. And the cloud doesn’t support batch termination makes our system worse.

I kind of understand this is tricky and hard to handle because workflow task retry handles both “pure orchestration” responsibility and “business control flow”. And it is hard for temporal SDK to split them, and (B) is deprived from us so that “pure orchestration” can be guaranteed.

However, after 7 years of application backend engineering experience (I worked for TikTok and Alibaba), I always think the first priority is to let the broken thing stop first and resume after human’s repair.
So what can I do to attain this, I have several tries here and please feel free to give feedbacks if I understand this topic totally wrong or my tricks are not valid at all (really appreciate) :

  • try {all} catch {throwable} and turn exception into Application Non-retry Exception manually (makes the code look bad)
  • Put more orchestration related logic into local activity which is actually split the business control flow from pure orchestration

This seems to be a complex topic and I may not be able to describe this problem very clearly. Please let me know if some points need more explanation.

=====================

Plus:

  1. “retry-as-new” from the broken point seems to be something we can use to restore our process. So we don’t lose the workflow even if we fail them
  2. if the application system cares more on time, the problem becomes worse. For example, after a user produces a video on some app, several steps are taken to actually generate the video so that the user can see the video soon on his/her app. This kind of async job should be finished in less than 1min, and if some corrupt data blocks all our workers with retry, things are dangerous.

so the third option is open up to user configuration for workflow task retry policy?

1 Like

I always think the first priority is to let the broken thing stop first and resume after human’s repair.

This is exactly the intent. The workflow doesn’t fail, but it is blocked until the bug is resolved. I agree that the original implementation of retrying the workflow task without backoff is not perfect.

  • Release v1.17.0 added backoff logic to workflow task retries. See pull request #2765.
  • We will add the ability to automatically suspend workflow execution after a certain number of workflow task retries. The batch command to unsuspend will be supported.

Note that you can configure your workflows to fail on any specific exception type. If you specify Throwable as such exception type, the workflow will fail on any unhandled exception.

And the cloud doesn’t support batch termination makes our system worse.

Batch will be supported in the cloud later this year.

However, this seems not the case true. I tested with Java SDK, the task sends to our worker in the long polling perpetually instead of stay blocked as something static?

But I think you have given me enough follow-up information. Really appreciate it :slight_smile:

Task retry is an internal implementation detail. From the application developer’s point of view, the workflow is blocked as it doesn’t make any forward progress. And as I pointed out in the previous post, the excessive resource utilization issue is something we are fixing.

Another follow up question is: how about the local activity idea? I kind of feel split orchestration and business flow as much as possible is a clean way. And if there is no explicit problems with it, I probably will advocate this way to my colleagues.

I don’t understand the difference between orchestration and business flow. For me, the business flow is an orchestration. Are you talking about a DSL interpreter workflow?

oh, I can explain a little. If the definition is

if (A > 10) {
    activity-1
}else {
   activity-2
}

I think the business logic on A is very thin to split from the orchestration.
But there are many cases the A is a huge business logic, for example, calculating the price of all goods in your Amazon cart, and DSL is also an example.
In this way, A becomes very fat and more capable of hiding bugs, and I think this is business-control in orchestration. I probably will recommend others to wrap the calculation into a local activity and not directly put in the definition so. that

  • the retry policy goes to local activity retry
  • the result is stored and not need versions for further help

I see. Yes, moving A into a local activity does make sense in this case. I’ve seen use cases when A was a complex rule engine that would return the next steps given the current workflow state.

1 Like

Hey,

the excessive resource utilization issue is something we are fixing.

I have a question if this problem is fixed. Or do we need to try catch everything to avoid this?

    override fun startExecution(dto: Data) {
        try {
           // logic
        } catch (e: Throwable) {
            // abnormal cases in activities or the workflow tasks just fail
            logger.error("[UnexpectedWorkflowFailure] ${e.message} ${e.stackTrace}")
            if (ApplicationFailure::class.java.isAssignableFrom(e.javaClass)) {
                throw e
            } else {
                throw ApplicationFailure.newNonRetryableFailure(e.message, e.javaClass.name)
            }
        }
    }

all throwable in our definition to avoid this case in our application? (We are on Temporal Cloud).

Or do we need to try catch everything to avoid this?

We are working on reply on your similar question in post here.

I have a question if this problem is fixed

In OSS the mentioned issue was resolved and merged and is included in OSS server versions 1.17.x.
For Cloud please ask this question in your company slack and we will check with server team to make sure we give you right right info.

1 Like