Is there a way to specify ScheduleToStart timeout for a worklfow?

Is there a way to specify ScheduleToStart timeout for a worklfow? (not Activity, but workflow)
And if not, why?

With start do you mean when the workflow execution is created and started by Temporal service or when your worker first starts processing your workflow code?

When you send a create execution request from your client the exec is created by server right away (if it can given your configured workflow id reuse policy).

If the first workflow task is sitting on the task queue and there are no workers to pick it up and start processing it then all other requests to start workflow executions on that task queue would have the same issue. In those cases stopping workflow execution and starting a new one might not make much difference.

Can you describe the use case you have for this please?

By ScheduleToStart I mean the time between the Workflow execution is created and the worker starts processing it.
The use case is exactly the one you are describing - if for some reason we do not have any workers available to start processing the workflow, we’d like to know that.
It is different from Workflow execution time out, because in our case the actual run-time can be quite long (e.g. 1 hour), but we don’t want to wait full one hour to just find out that the workflow never even had a chance to execute.

Could utilize metrics for this and monitor across your namespaces and task queues:

You could use the service_pending_requests server metric and alert on that, its emitted when there is a poller polling on frontend service and waiting for a task. If its value is 0 would mean there are no pollers and your workers could be down.

Another server metric is sync match rate:

sum(rate(poll_success_sync{}[1m])) - sum(rate(poll_success{}[1m]))

If this goes > 0 its indication you might not have enough workers available and could also mean workers are down.

Another thing to alert on are sdk metrics workflow_task_schedule_to_start_latency and activity_schedule_to_start_latency. If they become very high for a particular task queue your workflows are running on its a good indication that your workers are having issues / are down.

Could explain why there is no ScheduleToStart timeout for Workflows?

Activities have such a timeout, but not workflows.
Is there a specific reason for this asymmetry?

You don’t need ScheduleToStart timeout for activities in 99.9% of use cases as well. The only reasonable use for it is when routing activities to specific hosts. Workflows are never linked to specific hosts, so such timeout is not really useful. Metrics can be used to monitor worker downtime if needed.

1 Like

@maxim for completeness can the setWorkflowHostLocalTaskQueueScheduleToStartTimeout used to set the schedule to configure the Workflow’s ScheduleToStartTimeout ?

No. The workflow schedule to start timeout is always equal to WorkflowRunTimeout. The WorkflowHostLocalTaskQueueScheduleToStartTimeout defines how long a host specific task queue can stay there before being redispatched to the main workflow task queue.

1 Like

" WorkflowHostLocalTaskQueueScheduleToStartTimeout defines how long a host specific task queue can stay there before being redispatched to the main workflow task queue."

  • In case an application has a single task-queue, does the above timeout have any effect on a scheduled but not started workflow ? Would it be safe to conclude that the above timeout is significant only in case there are more than one task-queue ?

Workflow has a task queue specified when it is started. The same task queue name is used when the correspondent workflow worker starts. But each workflow worker always listens on additional task queue with the worker specific name which is called “host local”. This allows to route workflow tasks for cached workflows to specific processes. The above timeout applies only for tasks scheduled to “host local” task queues.

  • In case an application has a single task-queue, does the above timeout have any effect on a scheduled but not started workflow ? Would it be safe to conclude that the above timeout is significant only in case there are more than one task-queue ?

No. This timeout doesn’t apply to the situations described in the questions.

1 Like

Thanks Maxim! As always super helpful!