Error handling in Task Queue ChildWorkflows

Hi,

One of the workflows we have works with files, and so we are using task queue to make activities/child-workflows execute in the same worker.

The workflow is defined by:

  • Activity1. Download the file.
  • ChildWorkflow. Process the file .
  • Activity2. Upload the file.

The activity1 is picked by any worker, and the next steps are tied to that host specific task queue.
Note that we are using a ChildWorkflow in the 2nd step to process the file, which also has different steps.

Testing how it would recover from worker failures we found something that we don’t understand.

The history shows the ChildWorkflow (tied to a specific host by using a task queue) and we stopped the host specific worker while the TimerStarted and the ActivityTaskStarted.

...
final var promise = Async.procedure(() -> transcriptionsActivity.waitForProcessing(operationName));

try {
    promise.get(10, TimeUnit.MINUTES);
} catch (TimeoutException te) {
    throw Activity.wrap(te);
}
...

If I understand correctly, what’s happening behind is:

  1. The promise doesn’t finish in 10 minutes, so triggers the TimerFired
  2. The activity is configured with 10 attempts, so a Workflow task is scheduled (sticky first) and since the worker is down, it timeouts with ScheduleToStart.
  3. Since it was sticky, it schedules again in the shared task queue, which in this case is the host specific task queue, which is also down.

The questions I have is:

  • Why the ActivityTaskStarted (event 23) doesn’t timeout when it has configured a startToCloseTimeout of 6hours? (it has happened more than 24 hours).
  • Why the ActivityTaskStarted (event 23) iss shown as the last one when it happened right after the ActivityTaskScheduled (event 18)?
  • Why the WorkflowTaskScheduled (event 22) didn’t timeout with a ScheduleToStart when it was tied to a task queue of a worker that was down?

Thanks!

  • Why the ActivityTaskStarted (event 23) doesn’t timeout when it has configured a startToCloseTimeout of 6hours? (it has happened more than 24 hours).

It looks like the activity is in a retry loop. As its timeout is set to 6 hours then it is going to retry every 6 hours in this case unless the activities heartbeats with a much shorter heartbeat timeout.

  • Why the ActivityTaskStarted (event 23) iss shown as the last one when it happened right after the ActivityTaskScheduled (event 18)?

This is not a real event. It is there to show that activity is retrying. The real activity task started event is written only when activity retries are exhausted.

  • Why the WorkflowTaskScheduled (event 22) didn’t timeout with a ScheduleToStart when it was tied to a task queue of a worker that was down?

There is no ScheduleToStart for the non sticky task queue for WorkflowTasks.

Don’t use the host specific task queue for the child workflow. Workflows are not expected to be linked to specific hosts as they need to recover if a host goes down. For your use case routing all the activities from the child workflow to that host is enough.