Long-running workflow with significant fan-out of child workflows

Hi Everyone,

I have implemented a workflow for test purposes. The main workflow implementation is fairly simple: it just spawns child workflows, waits for them to be completed and the aggregates the result (the actual work is negligible here). Child workflows are not terribly complex either (consisting of 4-5 activities), however they can take hours or even days to complete, since they are coupled to real-world activities and events (which are represented as signals to the child workflows). So far so good.

The real world usage of this workflow would mean fanning out potentially hundreds of thousands of child workflows. As far as i read documentations and forums, this is not an issue to Temporal itself, but reading this article (Managing Long-Running Workflows with Temporal | Temporal) made me wonder that if i’d spawn that many child workflows, they would cause the parent to hit the event history limit and ultimately i need to incorporate continue-as-new into my parent workflow implementation.

I have two questions:
Q1: Is this idiomatic in temporal to implement a workflow like described above?
Q2: The main part of the implementation of the parent workflow looks like the following:

        location_insight_futures = [asyncio.create_task(workflow.execute_child_workflow(LocationReport.run, args=[location, params.report_id], id = location.id)) for location in params.locations]
        (completed, location_insight_futures) = await workflow.wait(location_insight_futures)
        location_insights = [fut.result() for fut in completed]

Do i need to worry about hitting the event history size limit (assuming e.g. 500000 spawned child workflows)? If yes, what is the best strategy to implement continue-as-new? The best that i can come up with would consist of two elements:

  1. Break up the fanout list comprehensions to smaller batches and check if continue-as-new is suggested between spawning child workflows.
  2. Changing workflow.wait from the default ALL_COMPLETED to FIRST_COMPLETED, and then also check for continue-as-new before waiting again.

While 2. would make some sense, since result aggregation could overlap with child execution, 1. is really mixing up underlying infra concerns (working around limits) with mundane business logic.

Any thoughts on this?

Thanks in advance
Andras

update: With some experimenting i already see a limitation biting the use-case: the number of outstanding child workflows cannot exceed 2000. That’s a big problem for this line of implementation approach.

You can create a tree of children. A parent spawns 1k children, each of them in turn spawns 1k children, and you get 1 million children total.

Hi Maxim,

Thanks for your response, i have implemented a tail-recursive variation of your idea. Works perfectly. There is a minor concern around mixing this logic with business intelligence, but the technical issue is solved for the time being.

Thanks for your help!

Best
Andras

1 Like