Best Practices for Implementing a Workflow to Process Millions of Files Concurrently with Heavy Child Workflow Activities

n3sser96 · August 11, 2024, 3:42pm

Hi everyone,

We’re working on a Temporal setup with a parent workflow named CrawlerWorkflow and a child workflow named ProcessingFileWorkflow. The CrawlerWorkflow is responsible for crawling and orchestrating the processing of millions of files, while the ProcessingFileWorkflow handles the actual processing of each file.

The child workflow (ProcessingFileWorkflow) includes several resource-intensive activities, such as:

Parsing file content (including OCR in some cases)
Document classification
Keyword extraction

The CrawlerWorkflow currently uses buffered channels and child workflows to manage concurrent processing. We’ve set a limit on the number of concurrent workflows and are using selectors to manage futures. However, we’re concerned about scalability and performance, particularly given the volume of files and the heavy activities within the ProcessingFileWorkflow.

Key details:

Targeting millions of files across multiple data sources.
The ProcessingFileWorkflow includes heavy activities like OCR, classification, and keyword extraction, which could potentially become bottlenecks.
We need to ensure that these workflows are executed efficiently and that the system can handle retries and failures gracefully.
We are running into limitations on the number of child workflows.

Questions:

Are there any best practices or design patterns in Temporal that we should consider for this scale of file processing, especially with these heavy activities in the child workflow?
How should we manage the concurrency and rate-limiting of child workflows effectively, considering the resource-intensive nature of some tasks?
How we can implement this without running into the limitations of the number of child workflows?

We appreciate any guidance or suggestions!

maxim · August 11, 2024, 4:30pm

The first question you want to answer is how much parallelism you want in the application. You need to process millions of files, but do you need to process all of them in parallel or in smaller chunks? For your use case, processing them with a limited number of parallel ProcessingFileWorkflow instances sounds like a way to go.
There are two separate ways to limit resource usage: limiting the global number of ProcessingFileWorkflow instances or limiting the number of specific heavy activities on each host. You probably want to do both.
Look at the batch-sliding-window example that demonstrates processing a large number of items using a sliding window. Here is a simpler approach using an iterator pattern. The last sample is in Java as it hasn’t been ported to Go yet. To limit specific activities per host, look at the worker.Options.MaxConcurrentActivityExecutionSize.

n3sser96 · August 11, 2024, 4:49pm

We need to process them in batches because we are indexing a high volume of data sources.
Yes, we are limiting both of them.
Can we do something like child-workflow-continue-as-new but for the parent workflow so we don’t run into the limitations of the history size in the parent workflow?

Note: some of the activities may take more than 30 min to finish & they are calling another service API, how to make sure that this will not lead to a deadlock?

maxim · August 11, 2024, 5:02pm

Look at the samples I linked. They use continue-as-new to work around the history size limit.

Note: some of the activities may take more than 30 min to finish & they are calling another service API, how to make sure that this will not lead to a deadlock?

I don’t understand why you think that a long running activity can cause a deadlock. Temporal doesn’t allow calling external APIs from workflow code directly. Violating this might cause the deadlock detector to fire.

Are you calling an API that takes 30 minutes synchronously?

n3sser96 · August 11, 2024, 5:09pm

Noted.

Yes, because we need the output from the call to do the next activity.

Thanks for your support!

maxim · August 11, 2024, 5:20pm

Yes, because we need the output from the call to do the next activity.

Sorry, this doesn’t make sense to me. What protocol do you use to make 30-minute synchronous call? And what does it have to do with the sequencing of activities?

n3sser96 · August 11, 2024, 5:36pm

Note: we are calling the API from the activity, not the workflow.

One of our gRPC services is responsible for parsing document content, the document can be a fully scanned document with more than 1000 pages to OCR.

This can take 2-30 mins, and the parsed content will be stored in a Minio bucket, and the ObjectName will be returned so that the remaining activities can take the content from Minio and perform classification and keywords extraction.

But, after having a second thought I think we can implement another workflow and trigger it whenever parse content is finished so it can do the remaining activities.

Is this a good idea?

maxim · August 11, 2024, 5:51pm

Sorry, I think I’m not very clear about my question.

AFAIK you cannot make 30-minute synchronous gRPC call. What is the protocol between you and the parsing service that supports such long running operations?

Can you host a Temporal activity inside that service? Temporal activities support long running operations out of the box through heartbeating.

n3sser96 · August 11, 2024, 6:13pm

HTTP/2, the gRPC doesn’t have deadlines if the client did not set any timeout.

Yes, I can. This is a great idea. Thanks a lot!

maxim · August 11, 2024, 6:18pm

HTTP/2, the gRPC doesn’t have deadlines if the client did not set any timeout.

I’m pretty sure that most load balancers and service meshes will not be happy about this.

Also, what does happen when the calling process restarts? Is it OK to lose 30 minutes of work in this case?

if you can host an activity then I certainly recommend doing this instead of gRPC.

n3sser96 · August 11, 2024, 6:23pm

We have few files of this kind, so out of 1 million files, you may get less than 20 files which will take this much time. Most of the files take less than 1 minute to be parsed.

Will do

Topic		Replies	Views
Creating Multiple Child Workflows Community Support general-impl	3	2223	April 9, 2023
Scale Temporal workflow Community Support	9	65	July 29, 2025
Child Workflow Continue as New and maximum event limit Community Support python-sdk , child-workflow , performance , activity	1	152	January 17, 2025
Temporal workflows Developer Corner	2	28	January 7, 2025
Creating large number of child workflows Community Support general-impl	2	1251	August 18, 2022

Best Practices for Implementing a Workflow to Process Millions of Files Concurrently with Heavy Child Workflow Activities

Related topics