My understanding is the new RetryOptions.Builder().setDoNotRetry() is to explicitly set exceptions where we do not want the workflow or activity to retry.
This seems to correspond to NonRetryableErrorReasons in the main website documentation when describing retry policies:
NonRetryableErrorReasons allows to specify errors that shouldn’t be retried. For example, retrying invalid arguments error doesn’t make sense in some scenarios.
However, the documentation for setDoNotRetry says the exceptions are the “List of exceptions to retry”.
Furthermore, the docs also say that “When matching an exact match is used,” but this RpcRetryer code suggests that it checks the superclass of the exceptions when deciding whether to retry.
You are 100% correct. The JavaDoc of the setDoNotRetry method is wrong. I submitted a PR to get it fixed.
The reason the child work didn’t retry because an activity failure is always wrapped in the ActivityFailure exception. As ActivityFailure by default is always retryable the child is not retried.
The solution is to throw an ApplicationFailure (possibly created with ApplicationFailure.newNonRetryableFailure) that wraps the ActivityFailure.
Thanks Maxim. Also, I should have clarified earlier but this is for cadence! That PR you linked is for temporal.
Followup question:
The reason the child work didn’t retry because an activity failure is always wrapped in the ActivityFailure exception. As ActivityFailure by default is always retryable the child is not retried.
If ActivityFailure is always retryable by default, why is the child not retried? Did you mean ActivityFailure is by default nonretryable?
The child workflow failed due to ActivityFailureException, but you say that ActivityFailureException is by default retryable (which was my understanding as well). Shouldn’t the child workflow have retried? I want the child workflow to retry but for some reason it didn’t retry with an ActivityFailureException.
ChildWorkflowImpl implements ChildWorkflow{
Response run(Request req) {
int jobId = activity.submitHadoopJob()
// if job is killed (by throwing a JobKilledException)
// then I have to resubmit the job and get a new jobId
activity.pollForCompletion(jobId)
}
}
I could handle the exception in the child workflow itself fairly easily by catching the exception in a while loop to determine if I should resubmit the job.
Is this not a good idea to use workflow retry in this case? If the workflow retry policies have issues, I can use a while loop to be safe.
AFAIK Cadence doesn’t support MySQL and PostreSQL for production workloads. In Temporal we fixed quite a few bugs with them and perform extensive testing with every release to ensure quality.
@qlong I think we have a different definition of support. Does Cadence team perform release testing (beyond unit test) of MySQL/PostgreSQL as part of their release process?
I see. But I guess you meant stress tests, because we are running integration test in CI which is more than unit tests.
Current release process is not performing stress test on MySQL/Postgres. But Cadence team is moving to sharded MySQL so it will come soon.
I do see the value you are pointing out, it’s great to hear that Temporal are performing all stress tests on MySQL/Postgres persistences. Even though it’s a thin layer, there could be changes that something unknown until we run into the scalability/perf issues.
btw the bug is not related to persistence layer. It’s an edge case that we don’t have unit/integ test covered. It was also broken for Cassandra but most people are not aware of. Temporal doesn’t have the issue because the refactoring you did has removed the fields that could caused it.