Worker does not start activity after restart

Bar_Deutsch · May 19, 2021, 2:39pm

Hi everyone, I’ve been doing some learning tests to figure out if temporal is a good fit for us.

I’ve been trying to deal with failures for example in case a worker shuts down in the middle of running an activity. If I have a running worker - everything works as expected and the activity retries on the second worker (and complete successfully).

My issue now is in case I only have one worker working. The order of actions is as follows:

run worker
run workflow with activity that takes ~2sec
stop worker
start worker
wait for workflow to finish

From looking at the logs and the temporalUI I can see that the worker is not picking up on the activity and we end up with only 2 attempts (second one isn’t actually running) and a NonRetryableFailure.

after the timeout:

Then the workflow continues running without anything happening (unless we terminate or define timeout for it).

My questions are:

Is this the desired behaviour? Meaning if I don’t have a worker running to run the activity it will fail/stay as “zombie”?
Is there any configuration that can change it?

Thanks!

maxim · May 19, 2021, 2:56pm

Could you post the ActivityTaskScheduled event? My guess is that you specified ScheduleToStart timeout and activity fails with it. ScheduleToStart is not retryable.

How do you handle the activity failure in your workflow code? By default if an unknown exception is thrown from the workflow function then it is not failed but blocked by constantly retrying workflow tasks. In this case the history is going to look exactly as in your snapshot. The workflow task retries are not written into the history to avoid growing it with each retry that’s why the last event you see in it is WorkflowTaskScheduled.

Bar_Deutsch · May 19, 2021, 4:53pm

My activity is very simple, it logs something then sleeps for 1 sec (enough time for me to stop the worker in the middle).
If activity returns an error the workflow returns the error:

err := workflow.ExecuteActivity(ctx, SleepActivity, input).Get(ctx, &result)
 if err != nil {
 	return err
 }

activity options are:

ao := workflow.ActivityOptions{
		TaskQueue:              TaskQueue,
		ScheduleToCloseTimeout: time.Second * 100,
		StartToCloseTimeout: time.Second * 10,
		HeartbeatTimeout:    time.Second * 10,
		WaitForCancellation: false,
	}

workflow options are the default ones (no timeout).

Thing is, I am getting ScheduleToStart timeout just like you said but after only single attempt.
I can see in my logs + temporalUI that the second attempt didn’t actually run (should succeed), it’s like the worker does not pull the task.

btw, if I have another worker running - it takes the second attempt and succeed.

maxim · May 19, 2021, 5:23pm

What does UI show in the summary page about that activity while the workflow is running?

Bar_Deutsch · May 19, 2021, 6:08pm

maxim · May 19, 2021, 7:48pm

It shows attempt 2 in the scheduled state. So the task is never picked up after the retry.

Bar_Deutsch · May 19, 2021, 8:41pm

Well, there is exactly 1 attempt, the second one is only scheduled but never picked up by the worker.
These are the last two logs, first row is printed inside the activity, no other logs after that, only the timeout error (even when timeout error is set for a long period)

2021/05/19 20:07:05 INFO  After sleep. Namespace default TaskQueue your-simple-task-queue WorkerID 21342@Bars-MacBook-Pro.local@ ActivityID 5 ActivityType SleepActivity Attempt 1 WorkflowType SleepWorkflow WorkflowID eae4adf4-272c-46e2-9a20-ebb4cff18dd0 RunID 06089ec9-26e3-486b-bd58-4a656ca12bf1
2021/05/19 20:07:05 INFO  Task processing failed with error Namespace default TaskQueue your-simple-task-queue WorkerID 21342@Bars-MacBook-Pro.local@ WorkerType ActivityWorker Error worker stopping

As I wrote earlier, if we have another worker I can see attempt number 2 is successful.

maxim · May 19, 2021, 9:59pm

So my guess is that your first worker doesn’t start correctly if it fails to poll on the task queue.

maxim · May 19, 2021, 10:31pm

You can click on the task queue name in the UI to see all the workers connected to the task list.

Bar_Deutsch · May 20, 2021, 7:56am

Thanks for the help!

This is what I’m seeing:

If the time hasn’t been updating since the workflow started this means the worker isn’t actually polling the queue?

Bar_Deutsch · May 20, 2021, 10:58am

I’ve tried another flow that’s confirming my worries:

2 workers are running
workflow starts
one worker stopped in the middle, then restarts
second worker picks up and finishes the workflow started at #2
after the first worker is restarted (at #3) we start a new workflow

expected: the worker that restarted successfully will pick up the new workflow and run it
actual: nothing happens, task is timed out and it looks like we have “zombie” workers.

maxim · May 20, 2021, 3:27pm

Yes, this is correct. The timestamp is of the last poll call.

expected: the worker that restarted successfully will pick up the new workflow and run it
actual: nothing happens, task is timed out and it looks like we have “zombie” workers.

I’m not sure how your workers are implemented. But it looks like they don’t start correctly after the restart as they are not polling on the task queues.

Bar_Deutsch · May 20, 2021, 4:27pm

I am actually stopping the workers by calling worker.Stop() then worker.Start().
The workers are simple ones just like in the samples in the documentation.

maxim · May 20, 2021, 4:28pm

Try restarting the process. I wouldn’t be surprised that Stop and Start have some issues if called from the same process multiple times. This is not very frequently executed code path.

Bar_Deutsch · May 23, 2021, 8:20am

Just wanted to say thanks for the answers, doing the same test with stoping/restarting the whole worker process worked. It seems to be an issue with the stop/start functions that creates a worker that is “running” but not taking any task.

Thanks!

maxim · May 23, 2021, 4:22pm

Were you calling stop and start on the same worker object or creating a new one after the stop?

Bar_Deutsch · May 24, 2021, 6:44am

I was calling worker.Start() after calling worker.Stop()

maxim · May 24, 2021, 3:48pm

Worker cannot be started after it was stopped. So Start of after the Stop is ignored. Filed an issue for Start to fail in this case.

Topic		Replies	Views
Activity Retry after Worker restart Community Support retries	4	855	July 2, 2021
Workflow Retry - Workflow should skip Activities which are successful in previous run Developer Corner java-sdk	8	1038	October 18, 2024
Activity not recovered after worker restarted Community Support go-sdk , general-impl	3	895	February 9, 2023
Retrying on ScheduleToStartTimeout Community Support typescript-sdk	2	343	October 12, 2023
What to do when an activity cannot proceed without re-running previously completed activities? Community Support python-sdk	4	73	December 20, 2024

Worker does not start activity after restart

Related topics