Hi everyone, I’ve been doing some learning tests to figure out if temporal is a good fit for us.
I’ve been trying to deal with failures for example in case a worker shuts down in the middle of running an activity. If I have a running worker - everything works as expected and the activity retries on the second worker (and complete successfully).
My issue now is in case I only have one worker working. The order of actions is as follows:
run worker
run workflow with activity that takes ~2sec
stop worker
start worker
wait for workflow to finish
From looking at the logs and the temporalUI I can see that the worker is not picking up on the activity and we end up with only 2 attempts (second one isn’t actually running) and a NonRetryableFailure.
Could you post the ActivityTaskScheduled event? My guess is that you specified ScheduleToStart timeout and activity fails with it. ScheduleToStart is not retryable.
How do you handle the activity failure in your workflow code? By default if an unknown exception is thrown from the workflow function then it is not failed but blocked by constantly retrying workflow tasks. In this case the history is going to look exactly as in your snapshot. The workflow task retries are not written into the history to avoid growing it with each retry that’s why the last event you see in it is WorkflowTaskScheduled.
My activity is very simple, it logs something then sleeps for 1 sec (enough time for me to stop the worker in the middle).
If activity returns an error the workflow returns the error:
workflow options are the default ones (no timeout).
Thing is, I am getting ScheduleToStart timeout just like you said but after only single attempt.
I can see in my logs + temporalUI that the second attempt didn’t actually run (should succeed), it’s like the worker does not pull the task.
btw, if I have another worker running - it takes the second attempt and succeed.
Well, there is exactly 1 attempt, the second one is only scheduled but never picked up by the worker.
These are the last two logs, first row is printed inside the activity, no other logs after that, only the timeout error (even when timeout error is set for a long period)
I’ve tried another flow that’s confirming my worries:
2 workers are running
workflow starts
one worker stopped in the middle, then restarts
second worker picks up and finishes the workflow started at #2
after the first worker is restarted (at #3) we start a new workflow
expected: the worker that restarted successfully will pick up the new workflow and run it
actual: nothing happens, task is timed out and it looks like we have “zombie” workers.
Yes, this is correct. The timestamp is of the last poll call.
expected: the worker that restarted successfully will pick up the new workflow and run it
actual: nothing happens, task is timed out and it looks like we have “zombie” workers.
I’m not sure how your workers are implemented. But it looks like they don’t start correctly after the restart as they are not polling on the task queues.
I am actually stopping the workers by calling worker.Stop() then worker.Start().
The workers are simple ones just like in the samples in the documentation.
Try restarting the process. I wouldn’t be surprised that Stop and Start have some issues if called from the same process multiple times. This is not very frequently executed code path.
Just wanted to say thanks for the answers, doing the same test with stoping/restarting the whole worker process worked. It seems to be an issue with the stop/start functions that creates a worker that is “running” but not taking any task.