Temporal Queue Activities

The issue we’re trying to avoid is that if a worker dies, all the workflows running in that host timeout.

Let’s be precise on the terminology. Workflows don’t run on a specific worker. So if a worker dies workflows are not affected. The activities running on that worker will timeout and you want to retry the whole sequence on a different most as the example demonstrate.

The main timeout you want to see on the host specific task queue is SCHEDULE_TO_START as it ensures that an activity task is not going to get stuck in the queue for long if the host is down. I highly recommend reading the blog post (or watch associated video) that explains activity timeouts in detail.

The Java sample retries in any failure, but in our use case we would only want to retry on TimeoutFailure with timeoutType TIMEOUT_TYPE_START_TO_CLOSE , right?

I agree that Workflow.retry makes it hard to retry on an exception which is not a top level one, but chained to ActivityFailure. The workaround is to rethrow the cause:

  @Override
  public void processFile(URL source, URL destination) {
    RetryOptions retryOptions =
        RetryOptions.newBuilder()
            .setInitialInterval(Duration.ofSeconds(1))
            .setDoNotRetry("io.temporal.failure.TimeoutFailure")
            .build();
    Workflow.retry(
        retryOptions,
        Optional.of(Duration.ofSeconds(10)),
        () -> {
          try {
            processFileImpl(source, destination);
          } catch (ActivityFailure e) {
            throw (TemporalFailure) e.getCause();
          }
        });
  }

I filed an enhancement request to support Workflow.retry that can execute some application code to decide if retry is needed.