Worker does not start activity after restart

Hi everyone, I’ve been doing some learning tests to figure out if temporal is a good fit for us.

I’ve been trying to deal with failures for example in case a worker shuts down in the middle of running an activity. If I have a running worker - everything works as expected and the activity retries on the second worker (and complete successfully).

My issue now is in case I only have one worker working. The order of actions is as follows:

  1. run worker
  2. run workflow with activity that takes ~2sec
  3. stop worker
  4. start worker
  5. wait for workflow to finish

From looking at the logs and the temporalUI I can see that the worker is not picking up on the activity and we end up with only 2 attempts (second one isn’t actually running) and a NonRetryableFailure.

after the timeout:

Then the workflow continues running without anything happening (unless we terminate or define timeout for it).

My questions are:

  1. Is this the desired behaviour? Meaning if I don’t have a worker running to run the activity it will fail/stay as “zombie”?
  2. Is there any configuration that can change it?

Thanks!

Could you post the ActivityTaskScheduled event? My guess is that you specified ScheduleToStart timeout and activity fails with it. ScheduleToStart is not retryable.

How do you handle the activity failure in your workflow code? By default if an unknown exception is thrown from the workflow function then it is not failed but blocked by constantly retrying workflow tasks. In this case the history is going to look exactly as in your snapshot. The workflow task retries are not written into the history to avoid growing it with each retry that’s why the last event you see in it is WorkflowTaskScheduled.

My activity is very simple, it logs something then sleeps for 1 sec (enough time for me to stop the worker in the middle).
If activity returns an error the workflow returns the error:

err := workflow.ExecuteActivity(ctx, SleepActivity, input).Get(ctx, &result)
 if err != nil {
 	return err
 }

activity options are:

ao := workflow.ActivityOptions{
		TaskQueue:              TaskQueue,
		ScheduleToCloseTimeout: time.Second * 100,
		StartToCloseTimeout: time.Second * 10,
		HeartbeatTimeout:    time.Second * 10,
		WaitForCancellation: false,
	}

workflow options are the default ones (no timeout).

Thing is, I am getting ScheduleToStart timeout just like you said but after only single attempt.
I can see in my logs + temporalUI that the second attempt didn’t actually run (should succeed), it’s like the worker does not pull the task.

btw, if I have another worker running - it takes the second attempt and succeed.

What does UI show in the summary page about that activity while the workflow is running?

It shows attempt 2 in the scheduled state. So the task is never picked up after the retry.

Well, there is exactly 1 attempt, the second one is only scheduled but never picked up by the worker.
These are the last two logs, first row is printed inside the activity, no other logs after that, only the timeout error (even when timeout error is set for a long period)

2021/05/19 20:07:05 INFO  After sleep. Namespace default TaskQueue your-simple-task-queue WorkerID 21342@Bars-MacBook-Pro.local@ ActivityID 5 ActivityType SleepActivity Attempt 1 WorkflowType SleepWorkflow WorkflowID eae4adf4-272c-46e2-9a20-ebb4cff18dd0 RunID 06089ec9-26e3-486b-bd58-4a656ca12bf1
2021/05/19 20:07:05 INFO  Task processing failed with error Namespace default TaskQueue your-simple-task-queue WorkerID 21342@Bars-MacBook-Pro.local@ WorkerType ActivityWorker Error worker stopping

As I wrote earlier, if we have another worker I can see attempt number 2 is successful.

So my guess is that your first worker doesn’t start correctly if it fails to poll on the task queue.

You can click on the task queue name in the UI to see all the workers connected to the task list.

Thanks for the help!

This is what I’m seeing:

If the time hasn’t been updating since the workflow started this means the worker isn’t actually polling the queue?

I’ve tried another flow that’s confirming my worries:

  1. 2 workers are running
  2. workflow starts
  3. one worker stopped in the middle, then restarts
  4. second worker picks up and finishes the workflow started at #2
  5. after the first worker is restarted (at #3) we start a new workflow

expected: the worker that restarted successfully will pick up the new workflow and run it
actual: nothing happens, task is timed out and it looks like we have “zombie” workers.

Yes, this is correct. The timestamp is of the last poll call.

expected: the worker that restarted successfully will pick up the new workflow and run it
actual: nothing happens, task is timed out and it looks like we have “zombie” workers.

I’m not sure how your workers are implemented. But it looks like they don’t start correctly after the restart as they are not polling on the task queues.

I am actually stopping the workers by calling worker.Stop() then worker.Start().
The workers are simple ones just like in the samples in the documentation.

Try restarting the process. I wouldn’t be surprised that Stop and Start have some issues if called from the same process multiple times. This is not very frequently executed code path.

Just wanted to say thanks for the answers, doing the same test with stoping/restarting the whole worker process worked. It seems to be an issue with the stop/start functions that creates a worker that is “running” but not taking any task.

Thanks!

Were you calling stop and start on the same worker object or creating a new one after the stop?

I was calling worker.Start() after calling worker.Stop()

Worker cannot be started after it was stopped. So Start of after the Stop is ignored. Filed an issue for Start to fail in this case.