Retry all workflow logic in `Workflow.retry`

Just want to double-check a design. I have a workflow that may throw an exception due to needing to update the code to handle a new case, or really any other reason. In this case I want to alert and log an error signifying that I need to go fix something, but I want to avoid failing the workflow pretty much at all costs because I don’t want to go through the process of figuring out which workflows need to be rerun after fixing the issue. Is wrapping my entire logic in Workflow.retry a good options for this, or is that not what Workflow.retry is intended for?

The code essentially looks like this without the retry.

// this is the @WorkflowMethod
public void run(InputObject input) {
  InternalType type = parseOption(input.getSomeValue());
} 

private InternalType parseOption(ExternalType externalType) {
  switch (externalType) {
    case VALUE_ONE:
      return InternalType.VALUE_ONE;
    default:
     // if this throws, the whole workflow will fail and it will have to be restarted
      throw new IllegalArgumentException("Unsupported external type " + externalType); 
  } 
}

For the version with the Workflow.retry I’m thinking I will just catch everything and always retry after emitting some debug info and metrics. I’ll include an escape exception that indicates we’re really messed up, and we won’t retry that one.

// this is the @WorkflowMethod
public void run(InputObject input) {
Workflow.retry(
        new RetryOptions.Builder()
            .setDoNotRetry(UnrecoverableConfigurationException.class)
            .build(),
        () -> {
          try {
            runInternal(input);
          } catch (UnrecoverableConfigurationException e) {
            // this one we can't recover from
            logger.error("Caught exception intentionally marked as unrecoverable. Exiting.", e);
            markFailure(input);
            // I'll also alert on this, but it will signal that something is very wrong and cannot
            // be fixed without restarting, such as the request being malformed.
            statsClient.increment("top_level_workflow.unrecoverable_exception");
            throw e;
          } catch (Exception e) {
            logger.error(
                "Caught exception from top level workflow. Logging, alerting, and retrying",
                e
            );
            // I'll later alert on this metric externally
            statsClient.increment("top_level_workflow.exception");
            throw e;
          }
        }
    );
} 

private void runInternal(InputObject input) {
 InternalType type = parseOption(input.getSomeValue());
}

private InternalType parseOption(ExternalType externalType) {
  switch (externalType) {
    case VALUE_ONE:
      return InternalType.VALUE_ONE;
    default:
     // if this throws, the whole workflow will fail and it will have to be restarted
      throw new IllegalArgumentException("Unsupported external type " + externalType); 
  } 
}

Should I just be using a loop for this or is the Workflow.retry a better option? I’m guessing the Workflow.retry won’t fill up the history, but maybe it will also make things harder to debug from the Cadence web UI? I know that I could just catch IllegalArgumentException as well, but this is just one example. I think there are realistically a few places that might throw, and it seems easiest to have a top-level catch to ensure that the workflow keeps retrying until things are fixed.

1 Like

Workflow.retry is recommended for retrying a part of a workflow to avoid retrying the whole workflow. If you need to retry the whole workflow then the recommended way is to specify retry options on workflow start using WorkflowOptions. This way it is going to restart workflow even on run timeout, not only on failure.

I understand that your sample is pretty artificial, but it doesn’t make sense to me. Input is not going to change on retry, so retrying the method is not going to help at all. It looks like you want to load the “input” through an activity which is executed on every retry. But then you can just perform the validation in the activity code and retry the activity on validation failure by specifying RetryOptions when scheduling it.

Ah, I didn’t think of that. That makes sense, but I’d still need to try-catch the whole workflow though to be able to log and emit metrics though before rethrowing to allow the retry to kick in, correct?

The input won’t change, but the code itself in the workflow could change. Imagine that ExternalType is a protobuf enum or something like that. The definition could be updated externally, then whatever is starting workflows could send one of the new values, and the workflow wouldn’t have the updated code to handle that new value.

Another example would be just some sort of bug in the code, like a NullPointerException. I could update the code to fix that bug, and then the workflow could succeed without changing the input.

The workflow failure is already logged and appropriate metrics are already emitted by both client and the service. But if you want your own logs and metrics then the try-catch around the whole code makes sense.

There is another option. If you throw an Error or its subclass from the workflow code it is not going to fail the workflow, but block its execution. Then you would be able to deploy the fix without actually needing to restart the whole workflow.

Thinking about this we could change the code of the SDK to always block workflow on any unexpected exception instead of failing it.

Gotcha, that makes sense.

Can you define what blocked means in this case? Is it just that the decision task fails, but does not cause the workflow to exit, and the workflow stalls without scheduling any further decision tasks? What would I have to do to “unblock” it in such a situation? Would I need to use the reset functionality? https://docs.temporal.io/docs/learn-cli/#restart-reset-workflow

Blocked means that throwing Error fails a decision task that is retried after workflow task timeout. The retries keep happening up to the workflow run timeout. As soon as the workflow code is fixed, the decision task completes successfully, and the workflow continues execution without any additional actions needed.

1 Like

Ah, I understand now, thanks!

Given that one of Cadence’s selling points is that it makes fault-tolerance easier, It does seem like it would be more “fault-tolerant by default” if the SDK were to always block the workflow on an unexpected exception and retry instead of failing it, as you mentioned.

Filed an issue to get “block on unexpected exception” implemented.

1 Like