Dear team!
I feel lack of understanding of main flows in core Temporal microservices. One of the important flow is sync\async matching.
From my understanding matching engine for perfromance reason try to match produces and consumer as soon as possible on the flight. But if worker comes to matcher with new task and same worker ready to process this task. Then sync match will be done each time? What is the flow for async match then? Or flow is different - worker comes with completed task, history takes next task and gives it to waiting poller - this is a sync match. correct?
Activity and workflow tasks are matched exactly the same way. There are two types of task queues: workflow and activity and they are completely independent. When a worker (defined in an SDK) is started with a “foo” task queue name it listens on two queues (assuming that both activities and workflows are registered with it), one for activities and another for workflows with the same “foo” name.
There are 5 high-level components involved in delivering a task to an external worker.
- An application Worker which is implemented using Temporal SDK and hosts activity and workflow code.
- Frontend Service Role
- History Service Role
- Matching Service Role
- Persistence that stores activity task queues
Normal Match
When a history service needs to dispatch a task to an application worker it looks at the task queue name and finds an appropriate matching host that owns that task queue using a consistent hash function. Then it calls add task operation on the matching service.
The matching service checks if there are any waiting long polls. And if there are no any it persists the task in the DB. Then it acknowledges the task acceptance to the history service which acks it to its internal transfer queue.
Later when a poll request comes from an external worker it is routed to the appropriate matching service host by a frontend. Then a task is loaded from the DB and returned to the caller. The diagram is simplified as Loat Task operation loads batches of tasks and caches them in matching host memory as an optimization.
Sync Match
In the sync match case, a matching service host receives one or more poll requests when there are no any tasks stored in the DB. In this case, it stores a poll request in an internal poll queue. If no task is added for a duration of the long poll timeout (one minute by default) the poll request is returned empty back to the worker which repeats it.
If history calls add task while a poll is waiting then the task is not saved in the DB and immediately matched with a poll request which returns it to a waiting worker. In this case, the add task operation is acknowledged to the history service immediately after the poll requests returns.
@maxim Thank you very much for detailed explanation which is greatly supported by diagrams. Is an idea it would be nice in future to link such kind of flows articals with reference to metrics description. In this case metrics which how rate and latency of Sunc and Normal(Async) match!
Thanks a lot!
@maxim, one question related to this:
Setting up workers listen to both workflow and activity task queue. I wonder what could be the reason for lower addworkflowtask latency comparing to addactivitytask in async match? Any lever I can try to make it even?
Also notice activity schedule to start latency higher than workflow counterpart, wonder what I can try to ease the situation?
Screen Shot 2021-10-13 at 12.02.16 AM|690x168
Thanks!
Worker polls for tasks only when it has available capacity. Probably not enough threads were given to activity executions. Try increasing WorkerOptions.MaxConcurrentActivityExecutionSize
.
Thanks for the guide.
I do see activity/workflow schedule to start latency improved a bit when boost up task poller. Increase MaxConcurrentActivityExecutionSize also show improvement in async match latency, but what could be the reason for the async match ending high latency? Should those config get increased?
worker:
MaxConcurrentActivityTaskPollers: 48
MaxConcurrentWorkflowTaskPollers: 16
MaxConcurrentActivityExecutionSize: 600
TaskQueueActivitiesPerSecond: 600 # 4 activities * 150 workflow
One minute is the long poll interval. It means that there are more pollers than tasks and some pollers end up waiting for a minute without getting any. It is not really a problem. On the contrary it indicates that your workers have enough capacity to process your requests.