Hi,
One of the workflows we have works with files, and so we are using task queue to make activities/child-workflows execute in the same worker.
The workflow is defined by:
- Activity1. Download the file.
- ChildWorkflow. Process the file .
- Activity2. Upload the file.
The activity1 is picked by any worker, and the next steps are tied to that host specific task queue.
Note that we are using a ChildWorkflow
in the 2nd step to process the file, which also has different steps.
Testing how it would recover from worker failures we found something that we don’t understand.
The history shows the ChildWorkflow
(tied to a specific host by using a task queue) and we stopped the host specific worker while the TimerStarted
and the ActivityTaskStarted
.
...
final var promise = Async.procedure(() -> transcriptionsActivity.waitForProcessing(operationName));
try {
promise.get(10, TimeUnit.MINUTES);
} catch (TimeoutException te) {
throw Activity.wrap(te);
}
...
If I understand correctly, what’s happening behind is:
- The promise doesn’t finish in 10 minutes, so triggers the
TimerFired
- The activity is configured with 10 attempts, so a Workflow task is scheduled (sticky first) and since the worker is down, it timeouts with
ScheduleToStart
. - Since it was sticky, it schedules again in the shared task queue, which in this case is the host specific task queue, which is also down.
The questions I have is:
- Why the
ActivityTaskStarted
(event 23) doesn’t timeout when it has configured astartToCloseTimeout
of 6hours? (it has happened more than 24 hours). - Why the
ActivityTaskStarted
(event 23) iss shown as the last one when it happened right after theActivityTaskScheduled
(event 18)? - Why the
WorkflowTaskScheduled
(event 22) didn’t timeout with aScheduleToStart
when it was tied to a task queue of a worker that was down?
Thanks!