Potential deadlock detected: workflow goroutine "root" didn't yield for over a second

Hello all,

workflows are panicing with the message

Potential deadlock detected: workflow goroutine “root” didn’t yield for over a second

stack trace

           TaskQueue: "task_progress_q"
  WorkerID: "7@7f1b9a937bfa@"
  StackTrace: "process event for task_progress_q [panic]:
  go.temporal.io/sdk/internal.(*coroutineState).call(0xc0063b8320)
  	go.temporal.io/sdk@v1.6.0/internal/internal_workflow.go:881 +0x217
  go.temporal.io/sdk/internal.(*dispatcherImpl).ExecuteUntilAllBlocked(0xc0063b82d0, 0x0, 0x0)
  	go.temporal.io/sdk@v1.6.0/internal/internal_workflow.go:966 +0x2f2
  go.temporal.io/sdk/internal.executeDispatcher(0x35d2ea0, 0xc00504e240, 0x35ed620, 0xc0063b82d0)
  	go.temporal.io/sdk@v1.6.0/internal/internal_workflow.go:569 +0xb3
  go.temporal.io/sdk/internal.(*syncWorkflowDefinition).OnWorkflowTaskStarted(0xc00504e200)
  	go.temporal.io/sdk@v1.6.0/internal/internal_workflow.go:542 +0x4a
  go.temporal.io/sdk/internal.(*workflowExecutionEventHandlerImpl).ProcessEvent(0xc0089e1c60, 0xc00267f2c0, 0x875d790100, 0x0, 0x0)
  	go.temporal.io/sdk@v1.6.0/internal/internal_event_handlers.go:795 +0x3c4
  go.temporal.io/sdk/internal.(*workflowExecutionContextImpl).ProcessWorkflowTask(0xc0081a78c0, 0xc0047c8ff0, 0x4ac0ce0, 0xc0058bd9d0, 0xc0081a78c0, 0x0)
  	go.temporal.io/sdk@v1.6.0/internal/internal_task_handlers.go:876 +0x73c
  go.temporal.io/sdk/internal.(*workflowTaskHandlerImpl).ProcessWorkflowTask(0xc000d40580, 0xc0047c8ff0, 0xc0049d0780, 0x0, 0x0, 0x0, 0x0)
  	go.temporal.io/sdk@v1.6.0/internal/internal_task_handlers.go:727 +0x739
  go.temporal.io/sdk/internal.(*workflowTaskPoller).processWorkflowTask(0xc000745930, 0xc0047c8ff0, 0x0, 0x0)
  	go.temporal.io/sdk@v1.6.0/internal/internal_task_pollers.go:288 +0x4ae
  go.temporal.io/sdk/internal.(*workflowTaskPoller).ProcessTask(0xc000745930, 0x2980340, 0xc0047c8ff0, 0x1, 0x0)
  	go.temporal.io/sdk@v1.6.0/internal/internal_task_pollers.go:259 +0x85
  go.temporal.io/sdk/internal.(*baseWorker).processTask(0xc0012c83c0, 0x297ff00, 0xc004072150)
  	go.temporal.io/sdk@v1.6.0/internal/internal_worker_base.go:342 +0xba
  created by go.temporal.io/sdk/internal.(*baseWorker).runTaskDispatcher
  	go.temporal.io/sdk@v1.6.0/internal/internal_worker_base.go:269 +0xff"
  Namespace: "default"
  RunID: "9bd60fd8-5d9c-4e91-8a5d-5a51d4e26b1e"
  context: {3}
  stack_trace: "Workflow panic: Potential deadlock detected: workflow goroutine "root" didn't yield for over a second
  goroutine 28366 [running]:
  getmega.com/pkg/temporal.(*zapWrapper).Error(0xc00000eb38, 0x3045581, 0xe, 0xc005eba6c0, 0x12, 0x12)
  	getmega.com/pkg@v0.0.0/temporal/logger.go:37 +0x105
  go.temporal.io/sdk/log.(*withLogger).Error(0xc0008f2630, 0x3045581, 0xe, 0xc0087cfd40, 0xc, 0xc)
  	go.temporal.io/sdk@v1.6.0/log/with_logger.go:71 +0xf0
  go.temporal.io/sdk/internal.(*workflowExecutionContextImpl).applyWorkflowPanicPolicy(0xc0081a78c0, 0xc0047c8ff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
  	go.temporal.io/sdk@v1.6.0/internal/internal_task_handlers.go:947 +0x3a9
  go.temporal.io/sdk/internal.(*workflowExecutionContextImpl).ProcessWorkflowTask(0xc0081a78c0, 0xc0047c8ff0, 0x4ac0ce0, 0xc0058bd9d0, 0xc0081a78c0, 0x0)
  	go.temporal.io/sdk@v1.6.0/internal/internal_task_handlers.go:925 +0x3ff
  go.temporal.io/sdk/internal.(*workflowTaskHandlerImpl).ProcessWorkflowTask(0xc000d40580, 0xc0047c8ff0, 0xc0049d0780, 0x0, 0x0, 0x0, 0x0)
  	go.temporal.io/sdk@v1.6.0/internal/internal_task_handlers.go:727 +0x739
  go.temporal.io/sdk/internal.(*workflowTaskPoller).processWorkflowTask(0xc000745930, 0xc0047c8ff0, 0x0, 0x0)
  	go.temporal.io/sdk@v1.6.0/internal/internal_task_pollers.go:288 +0x4ae
  go.temporal.io/sdk/internal.(*workflowTaskPoller).ProcessTask(0xc000745930, 0x2980340, 0xc0047c8ff0, 0x1, 0x0)
  	go.temporal.io/sdk@v1.6.0/internal/internal_task_pollers.go:259 +0x85
  go.temporal.io/sdk/internal.(*baseWorker).processTask(0xc0012c83c0, 0x297ff00, 0xc004072150)
  	go.temporal.io/sdk@v1.6.0/internal/internal_worker_base.go:342 +0xba
  created by go.temporal.io/sdk/internal.(*baseWorker).runTaskDispatcher
  	go.temporal.io/sdk@v1.6.0/internal/internal_worker_base.go:269 +0xff

the underlying workflows are long-running for eg like 50 days

workflow code :

     func Workflow(ctx workflow.Context, pTask *models.PlayerTask) (models.TaskState, error) {
     logger := workflow.GetLogger(ctx)
     
     	actx := workflow.WithActivityOptions(ctx, workflow.ActivityOptions{
     		StartToCloseTimeout: ActivityStartToCloseTimeout,
     		RetryPolicy:         defaultRetryPolicy,
     	})
     	if pTask.State.IsVisible() {
     		//sleep till active
     		if err := workflow.Sleep(ctx, pTask.ActiveTime.Sub(workflow.Now(ctx))); err != nil {
     			return "", errors.WithStack(err)
     		}
     		pTask.State = models.TaskStateActive
     		if err := workflow.ExecuteActivity(actx, new(progress_updater.Service).CustomActivityDBUpdate0, pTask).Get(ctx, nil); err != nil {
     			return "", errors.WithStack(err)
     		}
     	}
     
     	// updating current tracking metrics for filtering the task for datapoint creation
     	currentTracking = some logic
     	if err := workflow.ExecuteActivity(actx, new(progress_updater.Service).CustomActivityDBUpdate1, pId, tId,currentTracking).Get(ctx, nil); err != nil {
     		return "", errors.WithStack(err)
     	}
     	if pTask.State.IsActive() {
     
     		// selector, similar to select in go
     		s := workflow.NewSelector(ctx)
     
     		// channel on which task is listening for the progress update
     		signalChan := workflow.GetSignalChannel(ctx, signalChannel)
     		var progress *models.UpdateHolder
     
     		// infinite for loop which will break only when task is expired or completed
     		for {
     
     			// listener on the progress update channel
     			s.AddReceive(signalChan, func(c workflow.ReceiveChannel, more bool) {
     				c.Receive(ctx, &progress)
     				if err := workflow.ExecuteActivity(actx, new(progress_updater.Service).CustomActivityDBUpdate2, pId, tId).Get(ctx, &pTask); err != nil {
     					return
     				}
     				if pTask == nil {
     					return
     				}
     				tMetric := pTask.SingleActionProps.TrackingMetric
     
     				prgrs.UpdateProgress(progress, tMetric,pTask)
     
     				if err := workflow.ExecuteActivity(actx, new(progress_updater.Service).CustomActivityDBUpdate3, pTask, prgrs).Get(ctx, nil); err != nil {
     					return
     				}
     
     				if pTask.State.IsComplete() {
     
     					if err := workflow.ExecuteActivity(actx, new(progress_updater.Service).CustomActivityDBUpdate4, pTask).Get(ctx, nil); err != nil {
     						return
     					}
     					if err := workflow.ExecuteActivity(actx, new(progress_updater.Service).CustomActivityDBUpdate2, pId, tId).Get(ctx, &pTask); err != nil {
     						return
     					}
     					if err := workflow.ExecuteActivity(actx, new(progress_updater.Service).CustomActivityDBUpdate0, pTask).Get(ctx, nil); err != nil {
     						return
     					}
     				}
     
     			}).AddFuture(workflow.NewTimer(ctx, pTask.ExpireTime.Sub(workflow.Now(ctx))), func(f workflow.Future) {
     				pTask.State = models.TaskStateExpired
     				if err := workflow.ExecuteActivity(actx, new(progress_updater.Service).CustomActivityDBUpdate0, pTask).Get(ctx, nil); err != nil {
     					return
     				}
     			})
     
     			// blocked until a update signal received or task expired
     			s.Select(ctx)
     
     			if pTask.State.IsComplete() || pTask.State.IsExpired() {
     				break
     			}
     		}
     
       }
       return pTask.State, nil
     }

logic:

workflow is progressing a task on a particular user on his activities … progresses are recording using signals and if the provided time exceeded then the task will expire

the underlying workflows are still running as expected for the user tasks … but with logging this error occasionally

what could be the reason behind this?

version 1.8.2

thanks in advance,
Junaid

Does your Workflow code invoke blocking function(s) that are not part of the APIs of Temporal SDK?

This indicates that the workflow thread didn’t yield control through a Temporal API call for over a second. The most common reasons for this are:

  • Spinning loop in the workflow code
  • External blocking API calls in the workflow code or in the DataConverter
  • Using non Temporal APIs for synchronization

thanks for the reply @maxim and @tihomir

We are not using any blocking third-party APIs other than Temporal SDK.

I suspect your first point can be the problem … we use for loop to update progress using signals

let’s say we need 10 signals (added channel receiver on the selector to complete the task and 10 min expiry( have added a future on the selector with workflow timer)

so it loops around 10 times and completes the task if 10 signals receives, or expire after 10 min

s := workflow.NewSelector(ctx)

// channel on which task is listening for the progress update
signalChan := workflow.GetSignalChannel(ctx, signalChannel)

for {
  // listener on the progress update channel, mutiple progress needed for task completion, so 
  s.AddReceive(signalChan, func(c workflow.ReceiveChannel, more bool) {
    c.Receive(ctx, &progress)
    if err := workflow.ExecuteActivity(actx, new(progress_updater.Service).UpdateTaskProgress, pTask,progress).Get(ctx, nil); err != nil {
      return
    }
  // wait for the task to expire 
  }).AddFuture(workflow.NewTimer(ctx, pTask.ExpireTime.Sub(workflow.Now(ctx))), func(f workflow.Future) {
    pTask.State = models.TaskStateExpired
    if err := workflow.ExecuteActivity(actx, new(progress_updater.Service).UpdateTaskState, pTask).Get(ctx, nil); err != nil {
      return
    }
  })

  // blocked until a update signal received or task expired
  s.Select(ctx)
  if pTask.State.IsComplete() || pTask.State.IsExpired() {
    break
  }
}

this happening only for few workflow instances

please help me to clarify these doubts

  • If this loop is the issue how can we avoid these?

  • most importantly will this affect the expected behavior of the workflows? or will it skip some workflow processing? we can see it again comes back and getting blocked on Selector as expected? , from our initial analysis, no workflows have unexpected behavior other than not stopping.

  • will the panic occurrences reduce if we increase the computational power of the workers?

Regards,
Junaid

My guess is that the problem is that you reuse the same selector instance and keep adding the signalChannel branch to it. So the new branch matches immediately, so the for loop keeps spinning.

So either move s.Add.. out of the loop or create a new Selector on every iteration.

I doubt that might not be the problem from logs I can see branches are not matching immediately

 1   2021-06-18 09:08:08.000 IST  "NewTimer"
 2   Debug  2021-06-18 09:09:58.000 IST  "ExecuteActivity"
 3   Debug  2021-06-18 09:09:58.000 IST  "ExecuteActivity"
 4   Debug  2021-06-18 09:09:58.000 IST  "NewTimer"
 5   Debug  2021-06-18 09:21:41.000 IST  "ExecuteActivity"
 6   Debug  2021-06-18 09:21:46.000 IST  "ExecuteActivity"
 7   Debug  2021-06-18 09:21:46.000 IST  "NewTimer"
 8   Error  2021-06-18 09:29:49.000 IST  "Workflow panic"
 9   Warning 2021-06-18 09:29:49.000 IST  "Failed to process workflow task."
10   Debug  2021-06-18 09:29:59.000 IST  "ExecuteActivity"
11   Debug  2021-06-18 09:30:03.000 IST  "ExecuteActivity"
12   Debug  2021-06-18 09:30:05.000 IST  "NewTimer"
13   Debug  2021-06-18 09:33:17.000 IST  "ExecuteActivity"
14   Debug  2021-06-18 09:33:17.000 IST  "ExecuteActivity"
15   Debug  2021-06-18 09:33:17.000 IST  "NewTimer"

7 Debug 2021-06-18 09:21:46.000 IST "NewTimer"
8 Error 2021-06-18 09:29:49.000 IST "Workflow panic"

In line number 8, we can see it was waiting for the signal, from time 2021-06-18 09:21:46.000

and when it receives the signal around 2021-06-18 09:29:49.000 IST it got panic

there are 2 activity running on each signal processing

We tried using a new Selector on every iteration… that also not seems to be working …

will analyze some patterns will narrow down the possibilities …

One way to troubleshoot this problem is to breakpoint on the deadlock detector code and see which workflow goroutine is not returning control.

In Java SDK the deadlock detector message contains a thread dump of all of the workflow threads. This makes such troubleshooting trivial. Unfortunately, Go doesn’t expose any APIs to print goroutine stack traces from outside.