RetryOptions setDoNotRetry documentation mismatch

My understanding is the new RetryOptions.Builder().setDoNotRetry() is to explicitly set exceptions where we do not want the workflow or activity to retry.

This seems to correspond to NonRetryableErrorReasons in the main website documentation when describing retry policies:

  • NonRetryableErrorReasons allows to specify errors that shouldn’t be retried. For example, retrying invalid arguments error doesn’t make sense in some scenarios.

However, the documentation for setDoNotRetry says the exceptions are the “List of exceptions to retry”.

Furthermore, the docs also say that “When matching an exact match is used,” but this RpcRetryer code suggests that it checks the superclass of the exceptions when deciding whether to retry.

Link to JavaDocs

Am I misunderstanding something or looking at the wrong source code?



This is related to an error I’m debugging why a Child Workflow didn’t retry even though the retry policy should have allowed it to retry:

NonRetriableErrorReasons is only set to NonRetryableException

However, Child Workflow fails due to ActivityFailureException with cause JobKilledException instead of retrying.


In the activity code, it throws the exception throw new JobKilledException() without wrapping it because JobKilledException is a RuntimeException.

1 Like

You are 100% correct. The JavaDoc of the setDoNotRetry method is wrong. I submitted a PR to get it fixed.

The reason the child work didn’t retry because an activity failure is always wrapped in the ActivityFailure exception. As ActivityFailure by default is always retryable the child is not retried.

The solution is to throw an ApplicationFailure (possibly created with ApplicationFailure.newNonRetryableFailure) that wraps the ActivityFailure.

Thanks Maxim. Also, I should have clarified earlier but this is for cadence! That PR you linked is for temporal.

Followup question:

The reason the child work didn’t retry because an activity failure is always wrapped in the ActivityFailure exception. As ActivityFailure by default is always retryable the child is not retried.

If ActivityFailure is always retryable by default, why is the child not retried? Did you mean ActivityFailure is by default nonretryable?

The child workflow failed due to ActivityFailureException, but you say that ActivityFailureException is by default retryable (which was my understanding as well). Shouldn’t the child workflow have retried? I want the child workflow to retry but for some reason it didn’t retry with an ActivityFailureException.

I’m not sure at this point about Cadence. I believe it had some issues about workflow retry policies.

I personally think it is very rarely a good idea to retry workflows. Could you handle the activity exception in the child workflow itself?

The use case for the workflow retry is like this:

ChildWorkflowImpl implements ChildWorkflow{

  Response run(Request req) {

    int jobId = activity.submitHadoopJob()

    // if job is killed (by throwing a JobKilledException)
    // then I have to resubmit the job and get a new jobId
    activity.pollForCompletion(jobId)
  }
}

I could handle the exception in the child workflow itself fairly easily by catching the exception in a while loop to determine if I should resubmit the job.

Is this not a good idea to use workflow retry in this case? If the workflow retry policies have issues, I can use a while loop to be safe.

This github issue seems to fit my problem exactly… sigh

AFAIK Cadence doesn’t support MySQL and PostreSQL for production workloads. In Temporal we fixed quite a few bugs with them and perform extensive testing with every release to ensure quality.

Thanks for reporting.

The issue is fixed, and will be included in the coming release. See more details in the ticket.

Clarification: Cadence does support MySQL/Postgres as production use cases :smiley: Myself and many people are using MySQL/Postgres in production already.

NOTE: I/Cadence team are not active in this forum, but feel free to reach out in the Github issues/slack.

@qlong I think we have a different definition of support. Does Cadence team perform release testing (beyond unit test) of MySQL/PostgreSQL as part of their release process?

I see. But I guess you meant stress tests, because we are running integration test in CI which is more than unit tests.

Current release process is not performing stress test on MySQL/Postgres. But Cadence team is moving to sharded MySQL so it will come soon.

I do see the value you are pointing out, it’s great to hear that Temporal are performing all stress tests on MySQL/Postgres persistences. Even though it’s a thin layer, there could be changes that something unknown until we run into the scalability/perf issues. :smiley:

btw the bug is not related to persistence layer. It’s an edge case that we don’t have unit/integ test covered. It was also broken for Cassandra but most people are not aware of. Temporal doesn’t have the issue because the refactoring you did has removed the fields that could caused it.