WorkflowTaskTimedOut (ScheduleToStart) and delay in execution

Hi,

We are seeing an issue where we use a selector for our processing

for {
  // listener on the progress update channel, mutiple progress needed for task completion, so 
  s.AddReceive(signalChan, func(c workflow.ReceiveChannel, more bool) {
    c.Receive(ctx, &progress)
    if err := workflow.ExecuteActivity(actx, new(progress_updater.Service).UpdateTaskProgress, pTask,progress).Get(ctx, nil); err != nil {
      return
    }
  // wait for the task to expire 
  }).AddFuture(workflow.NewTimer(ctx, pTask.ExpireTime.Sub(workflow.Now(ctx))), func(f workflow.Future) {
    pTask.State = models.TaskStateExpired
    if err := workflow.ExecuteActivity(actx, new(progress_updater.Service).UpdateTaskState, pTask).Get(ctx, nil); err != nil {
      return
    }
  })

we have TimerFire at 12:AM ( pTask.ExpireTime) … so ideally it should be processed exactly at 12AM (all workflow do timeFire) … but we are seeing a staggering trend in task pickup in workers across workflows where some of them complete at 12AM but some of them delaying for around 5 to 50 min.

distribution for ref:

example history for a delayed workflow ( it took around 41 min extra to complete the desired … )

79  TimerFired              Sep 3rd 12:00:02 am
80  WorkflowTaskScheduled   Sep 3rd 12:00:02 am
81  WorkflowTaskTimedOut    Sep 3rd 12:00:07 am (ScheduleToStart)
82  WorkflowTaskScheduled   Sep 3rd 12:00:07 am
83  WorkflowTaskStarted     Sep 3rd 12:16:19 am
84  WorkflowTaskCompleted   Sep 3rd 12:16:19 am
85  ActivityTaskScheduled   Sep 3rd 12:16:19 am
86  ActivityTaskStarted     Sep 3rd 12:28:13 am
87  ActivityTaskCompleted   Sep 3rd 12:28:13 am
88  WorkflowTaskScheduled   Sep 3rd 12:28:13 am
89  WorkflowTaskTimedOut    Sep 3rd 12:28:18 am (ScheduleToStart)
90  WorkflowTaskScheduled   Sep 3rd 12:28:18 am
91  WorkflowTaskStarted     Sep 3rd 12:41:06 am

all activity uses default workflow TaskQueue and we have the timeFire at 12AM for all workflows

  • what could be the reason for it?.

  • should we create a separate TaskQueue for an activity for each workflow ( assuming that queue is flooding and because of the scheduleToStart time out happening ). ?

  • preferred worker config to improve this?

    we tried default, and currently, we have

     options := worker.Options{
		MaxConcurrentActivityExecutionSize:     500,
		WorkerActivitiesPerSecond:              1000,
		WorkerLocalActivitiesPerSecond:         1000,
		TaskQueueActivitiesPerSecond:           1000,
		MaxConcurrentWorkflowTaskExecutionSize: 500,
		MaxConcurrentActivityTaskPollers:       4,
	}

we have no of workflows in order 5k.

please help with your valuable feedbacks.

Thanks in advance,
Junaid

  • should we create a separate TaskQueue for an activity for each workflow ( assuming that queue is flooding and because of the scheduleToStart time out happening ). ?

You are seeing ScheduleToStart timeouts for workflow tasks. These happen when the worker is overloaded with workflow tasks. The history shows that the timer itself fired at 12:00:02. But the workflow task timed out and then only started 16 minutes later. Check the value of attempts field on event 83 (WorkflowTaskStarted) if it is above 1 then workflow tasks failed multiple times.

My guess is that your workers are not provisioned to handle the traffic generated by the timers firing at the same time.

Thanks @maxim

WorkflowTaskStarted not printing attempts field ( but present for WorkflowTaskScheduled)

but event 80 has


taskQueue.name.   d0d903738647:7d9d399b-2720-4969-8fe5-f97d632f1806

taskQueue.kind.    Sticky

startToCloseTimeout. 10s

attempt.              1

event 82 has

taskQueue.name.         task_progress_q
taskQueue.kind.         Normal
startToCloseTimeout.    10s
attempt.                1

I can see the task queue name random for the first task schedule.

  • reducing the MaximumInterval the RetryPolicy for the WorkflowOptions and ActivityOptions will reduce the impact right?

  • will check the scalability issue you have mentioned. do you have any suggestions specifically for temporal for scalability? because this is not push-based,?

I can see the task queue name random for the first task schedule.

This is an artifact of workflow caching. See this post that explains it in detail.

  • reducing the MaximumInterval the RetryPolicy for the WorkflowOptions and ActivityOptions will reduce the impact right?

No. RetryPolicy has nothing to do with the inability of workers to process a backlog of workflow tasks quickly.

will check the scalability issue you have mentioned. do you have any suggestions specifically for temporal for scalability? because this is not push-based, we are using flex env GCP?

Add more worker processes.

sure… I will increase the worker capacity …

by RetryPolicy what I meant is it become back off exponentially and it pickups after a long time even though the worker is able? so the span can be reduced by altering the retry policy?

The RetryPolicy is only used to calculate a retry interval for activity and workflow failures. Your issue is not related to workflow and activity failures, so changing the RetryPolicy doesn’t really help.

@maxim We have checked the scalability issues of our servers (no noticeable issues there ). we understand that altering worker options changes the behavior. and we are running performance testing with altering worker options to find the suitable values for us.

meanwhile, it will be helpful if you could comment on a worker options

It should be able to pick up the task as fast as possible ( given the workers have the capacity to handle the traffic)

we are using

options := worker.Options{
		MaxConcurrentActivityExecutionSize:      2500,
		MaxConcurrentLocalActivityExecutionSize: 2500,
		MaxConcurrentWorkflowTaskExecutionSize:  3000,
		MaxConcurrentActivityTaskPollers:        16,
		MaxConcurrentWorkflowTaskPollers:        16,
	}
  • will this throttle the worker?

  • if we want to process 5000 workflows and 5 activities in each workflow provided all activities will be executed in 50 ms. what will be the best values? we want to process it as fast as possible given that the worker has the ability to scale? so moving from 5k to 50k should not be a problem.

what can be the max values we can provide for eg:

MaxConcurrentActivityExecutionSize?
MaxConcurrentWorkflowTaskPollers?
MaxConcurrentWorkflowTaskPollers?
.....

is there any upper limit of any sort for any variable?

Thanks in advance,
Junaid

  MaxConcurrentActivityExecutionSize:      2500,
  MaxConcurrentLocalActivityExecutionSize: 2500,

These are too large. They define the sizes of thread pools used to execute activities. So. you are practically allocating up to 5k threads just for activity execution.

MaxConcurrentWorkflowTaskExecutionSize: 3000,

This also defines the maximum size of the thread pool used to execute workflow tasks. Such a large pool is not really needed. This value should not exceed the value of WorkerFactoryOptions.maxWorkflowThreadCount which defines how many threads workflows can use including all cached workflows. You might try bumping maxWorkflowThreadCount from default 600, but not too much.

MaxConcurrentActivityTaskPollers: 16,
MaxConcurrentWorkflowTaskPollers: 16,

These look OK.

thanks @maxim

so as per my understanding from your reply these worker options are not throttling the workers, right?

and what is the equivalent variable of maxWorkflowThreadCount in golang?

so as per my understanding from your reply these worker options are not throttling the workers, right?

Allocating 8k threads certainly can slow down workers.