I am trying to understand how Temporal handles worker node failures that run workflows and activities.
I created an activity with an artificial delay of 500ms and executed in 30 times in a workflow. This is the ActivityOptions I used:
ActivityOptions options = ActivityOptions.newBuilder() .setScheduleToCloseTimeout(Duration.ofSeconds(5)) .setTaskQueue(Constants.TASK_QUEUE) .build();
Since I have used a timeout of 5 seconds for the activity, I thought Temporal will take 5 seconds to detect a node failure and will rerun the activity automatically in a different node. No retry configuration will be required for this. Please let me know if my understanding is correct.
In my test setup, I started with 2 worker nodes and executed 10 activities. In the middle of the execution, I brought down 1 worker node. I found that 4 workflows got completed successfully. 6 workflows failed with the following error. Not sure what am I missing here, as per my understanding all the workflows should have completed successfully since one worked node is still available.