We have a simple workflow that looks similar to this:
The workflow starts out with a file that has potentially millions of records in it that need to be processed. I can see a couple of options available to process this:
- Create a single workflow with Activities at each stage that process a file. Each Activity would take a reference to a file, process the file and then return a reference to the newly created processed file. This is then passed on to the next Activity and so on. Each Activity can check the heartbeat to resume from a particular point if there are any issues during the processing.
- The original file is opened and each object is processed in a loop. Each Activity would then only process a single object so the object could probably be passed through as an argument rather than referenced in a file.
- The original file is opened and a separate workflow spawned for each object in the file. Each Activity would then only process a single object so this could probably be passed through as an argument rather than referenced in a file. Each child workflow would need to be tracked so the parent workflow completes once all children are complete.
Option 2 has the potential to create millions of events within the single workflow so I think the internal limit of 100k events per workflow would render this option unviable.
Option 1 & Option 3 look like they could both work.
Option 1 Pro’s & Cons
- Pro: Not triggering multiple workflows
- Con: Cannot process multiple objects in parallel
Option 3 Pro’s & Cons
- Pro: can process multiple objects in parallel
- Con: Have to create a lot of workflows.
- Con: cannot take advantage of any bulk operations e.g. bulk saving of objects.
In this type of scenario is there any advice on which approach should be taken or pitfalls that one approach may have over the other?