Hi Everyone,
I have implemented a workflow for test purposes. The main workflow implementation is fairly simple: it just spawns child workflows, waits for them to be completed and the aggregates the result (the actual work is negligible here). Child workflows are not terribly complex either (consisting of 4-5 activities), however they can take hours or even days to complete, since they are coupled to real-world activities and events (which are represented as signals to the child workflows). So far so good.
The real world usage of this workflow would mean fanning out potentially hundreds of thousands of child workflows. As far as i read documentations and forums, this is not an issue to Temporal itself, but reading this article (Managing Long-Running Workflows with Temporal | Temporal) made me wonder that if i’d spawn that many child workflows, they would cause the parent to hit the event history limit and ultimately i need to incorporate continue-as-new into my parent workflow implementation.
I have two questions:
Q1: Is this idiomatic in temporal to implement a workflow like described above?
Q2: The main part of the implementation of the parent workflow looks like the following:
location_insight_futures = [asyncio.create_task(workflow.execute_child_workflow(LocationReport.run, args=[location, params.report_id], id = location.id)) for location in params.locations]
(completed, location_insight_futures) = await workflow.wait(location_insight_futures)
location_insights = [fut.result() for fut in completed]
Do i need to worry about hitting the event history size limit (assuming e.g. 500000 spawned child workflows)? If yes, what is the best strategy to implement continue-as-new? The best that i can come up with would consist of two elements:
- Break up the fanout list comprehensions to smaller batches and check if continue-as-new is suggested between spawning child workflows.
- Changing workflow.wait from the default ALL_COMPLETED to FIRST_COMPLETED, and then also check for continue-as-new before waiting again.
While 2. would make some sense, since result aggregation could overlap with child execution, 1. is really mixing up underlying infra concerns (working around limits) with mundane business logic.
Any thoughts on this?
Thanks in advance
Andras
update: With some experimenting i already see a limitation biting the use-case: the number of outstanding child workflows cannot exceed 2000. That’s a big problem for this line of implementation approach.