Workflow Handler stopped

Hello,

We are having some issues with workers. The env is used for tests, so load is very limited. Need a guidance on how to check what happened.

On Friday we noticed that some Workflows were “stuck” in activities that shouldn’t fail ( the activity are not dependent of external apis yet, so shouldn’t fail or timeout ). During the weekend, the task was finally processed, but took 13h to complete :

As seen in the above image, the activity had 14 attempts before completing.

The number of attempts makes sense because of the 1h timeout of scheduleToStartTimeout and 1h for startToCloseTimeout, but the reason for being “stuck” is a mistery.

This morning we continued the search for answer about this and noticed that new workflows are not being processed by the worker.

Seems that the worker is not handling Workflows Tasks

haven’t tried yet to restart the worker as i want to understand why this is happening.

Using tctl we have the following for the queue :

Thanks for the help,
Pedro Almeida

These look like two separate issues.

I don’t think we can help with the activity timeout. You have to understand what causes it. I would recommend looking into the activity worker logs to see why the activity never completes. Apparently, its threads are stuck on something. So thread dump might help to troubleshoot.

As far as the second issue.
How do you initialize your worker? Do you have a single worker for both activities and workflows?

Hello Maxim,

We create the worker this way :

  worker.SetStickyWorkflowCacheSize(2500)

	w := worker.New(c.Client, "niko-worker", worker.Options{
		MaxConcurrentActivityExecutionSize:      0,
		WorkerActivitiesPerSecond:               50000,
		MaxConcurrentLocalActivityExecutionSize: 0,
		WorkerLocalActivitiesPerSecond:          0,
		TaskQueueActivitiesPerSecond:            0,
		MaxConcurrentActivityTaskPollers:        0,
		MaxConcurrentWorkflowTaskExecutionSize:  500,
		MaxConcurrentWorkflowTaskPollers:        0,
		EnableLoggingInReplay:                   false,
		DisableStickyExecution:                  false,
		StickyScheduleToStartTimeout:            0,
		BackgroundActivityContext:               nil,
		NonDeterministicWorkflowPolicy:          0,
		WorkerStopTimeout:                       0,
		EnableSessionWorker:                     false,
		MaxConcurrentSessionExecutionSize:       0,
		WorkflowInterceptorChainFactories:       nil,
	})

    w.RegisterWorkflow(workflowA)
    ... ( register all workflows )
    w.RegisterActivity(activityA)
     ... ( register all activities )

This is the only worker.

Thanks again for the help.

My guess is that setting MaxConcurrentWorkflowTaskPollers to 0 disables the poller.

I would be also careful with MaxConcurrentActivityExecutionSize which might choke the activity execution pool.
I would recommend omitting properties that you don’t plan to change.

Thanks for the feedback @maxim,

We restarted the worker with all default options. So far don’t see any improvements.

what is the meaning of PENDING_ACTIVITY_STATE_STARTED status?

Is the activity task being processed by the worker or is the activity task waiting for the worker to start processing it?

PENDING_ACTIVITY_STATE_STARTED means that the activity was picked by a worker and being processed.

I would recommend a thread dump of the worker to see what its threads are doing.

Apparently the worker wasn’t doing anything, but the issue disappeared for new workflows after changing the Activities timeouts params ( heartbeat was missing, ScheduleToStart removed an set the ScheduleToClose instead ( StartToClose was already set ) )

Thanks for all the help @maxim

1 Like