Batch Processing vs Multiple Workflows

We have a simple workflow that looks similar to this:

The workflow starts out with a file that has potentially millions of records in it that need to be processed. I can see a couple of options available to process this:

  1. Create a single workflow with Activities at each stage that process a file. Each Activity would take a reference to a file, process the file and then return a reference to the newly created processed file. This is then passed on to the next Activity and so on. Each Activity can check the heartbeat to resume from a particular point if there are any issues during the processing.
  2. The original file is opened and each object is processed in a loop. Each Activity would then only process a single object so the object could probably be passed through as an argument rather than referenced in a file.
  3. The original file is opened and a separate workflow spawned for each object in the file. Each Activity would then only process a single object so this could probably be passed through as an argument rather than referenced in a file. Each child workflow would need to be tracked so the parent workflow completes once all children are complete.

Option 2 has the potential to create millions of events within the single workflow so I think the internal limit of 100k events per workflow would render this option unviable.

Option 1 & Option 3 look like they could both work.

Option 1 Pro’s & Cons

  1. Pro: Not triggering multiple workflows
  2. Con: Cannot process multiple objects in parallel

Option 3 Pro’s & Cons

  1. Pro: can process multiple objects in parallel
  2. Con: Have to create a lot of workflows.
  3. Con: cannot take advantage of any bulk operations e.g. bulk saving of objects.

In this type of scenario is there any advice on which approach should be taken or pitfalls that one approach may have over the other?

2 Likes

Option 1 Pro’s & Cons

  1. Con: Cannot process multiple objects in parallel

You can run multiple parallel activities for different parts of the file. For example for a large file stored in S3 you could download parts of it independently to different hosts.

Option 3 Pro’s & Cons
3. Con: cannot take advantage of any bulk operations e.g. bulk saving of objects.

There are ways to buffer events across multiple activity invocations. For example, accumulate results from many activities on a worker and complete all of these activities asynchronsouly after a bulk operation is done.

Scenarios

Option 1

This is the simplest approach if processing each record of the file is simple and short-lived. An activity implementation can process multiple records in parallel if it helps to speed things up.

Option 2

This approach is still useful if the size of the file is bounded as it is the simplest one.

There is a variation of this approach that works with files of unlimited size which I call iterator workflow. The idea is to process a part of the file and then call continue as new to continue processing. This way each run of a workflow that processes a range of records is bounded in size. This approach also works if each record requires a child workflow for processing.

Option 3

The child workflow per record option is needed if each record requires independent orchestration which can take an unpredictable amount of time. If the number of records in the file is large the options are either use the iterator workflow approach I described in Option 2 or use hierarchical workflows. For example, a parent workflow with 1000 children and each of them with 1000 children allows starting 1 million workflows without hitting any worklfow size limits.

Thanks Maxim for the feedback.

In regards to Option 1. what is your definition of short-lived?

In the sample workflow diagram I think that most of the Activities would complete processing a file within a couple of minutes with some maybe taking up to 30mins. The “Save” Activity would be writing Objects to something like Elastic so would take longer depending on the size of the file. I assume the “Save” activity could also process batches in parallel to speed things up.

If we do go with more of a streaming approach I like the idea of the iterator workflow. I’ve seen the ContinueAsNew option in some of the examples. If we did want to take advantage of bulk saving would the steps be something similar to:

  1. Have loop which gets a batch of x objects to process. Keep track of the current batch count
  2. Create x asynchronous child workflows where the child workflow would be similar to the workflow diagram except remove the “Save” & “Update Status” Activities.
  3. Wait for all child workflows to complete.
  4. Get back results from x child workflow and then pass to the “Save” activity
  5. Based on number of iterations of loop call “Continue As New”
  6. Once all objects processed call the “Update Status” activity.

This would:

  1. Avoid any issues with event or history limits
  2. Allow objects to be processed in parallel
  3. Allow bulk saving of objects

In regards to Option 1. what is your definition of short-lived?

By short lived I meant that the file can be processed sequentially row by row. If each row processing takes a long time or requires some sort of a state machine a simple scanning solution is not going to work. From the Temporal point of view, an activity can run as long as needed. For such long running activities, heartbeating is important to detect failures in a timely manner.

The proposed design looks good to me. Make sure that the “bulk saving” activity input is of reasonable size.

Thanks Maxim.