Limits the maximum time that an Activity Task can sit in a Task Queue. Mainly to identify whether a Worker is down or for Task routing. This is rarely needed!
I also note that the default for this value is infinite.
Now, I’ve run into a situation in which my system was configured incorrectly, and no activity covering this task was registered on the active workers. The task was scheduled and in the queue. It seems like in this situation, the scheduleToStartTimeout of 10 seconds should trigger – after all the activity was scheduled but not started for more than 10 seconds, but it does not apply – the task remains in the queue potentially forever, but probably until some upstream timeout is triggered.
Can someone explain the reasoning behind the current design of this timeout?
I think there is a misunderstanding here, to be clear, If an activity type is not registered, but was scheduled it will still be started if any worker is listenting on the task queue.
This is how the protocol between the Temporal worker and server works and is important to safely roll out new Workflow/Activity types. Workers poll for work by task queue , not by activity type, if a worker is listening on a task queue it can get any tasks assigned to that task queue. Once a worker picks up a task the server considers that task as started. If the worker does not know how to handle that activity type then the worker will fail that activity attempt, but from the servers perspective the task was started. This behaviour is important when rolling out new types of activities/workflows that old workers may not understand
I think it would be helpful if the docs included this information because while this technical definition of “started” makes sense with the explanation above, it would not match most users’ mental model of what it means for an activity to be “started”.
And a follow-up question then of course is… how to detect this situation if ScheduledToStart is not the way to do it?
I know but which one? Would it be activity_execution_failed? Because presumably the worker has obtained and started the activity, but since it is not registered, the activity_execution_failed metric is incremented?