We’re working on a Temporal setup with a parent workflow named CrawlerWorkflow and a child workflow named ProcessingFileWorkflow. The CrawlerWorkflow is responsible for crawling and orchestrating the processing of millions of files, while the ProcessingFileWorkflow handles the actual processing of each file.
The child workflow (ProcessingFileWorkflow) includes several resource-intensive activities, such as:
Parsing file content (including OCR in some cases)
Document classification
Keyword extraction
The CrawlerWorkflow currently uses buffered channels and child workflows to manage concurrent processing. We’ve set a limit on the number of concurrent workflows and are using selectors to manage futures. However, we’re concerned about scalability and performance, particularly given the volume of files and the heavy activities within the ProcessingFileWorkflow.
Key details:
Targeting millions of files across multiple data sources.
The ProcessingFileWorkflow includes heavy activities like OCR, classification, and keyword extraction, which could potentially become bottlenecks.
We need to ensure that these workflows are executed efficiently and that the system can handle retries and failures gracefully.
We are running into limitations on the number of child workflows.
Questions:
Are there any best practices or design patterns in Temporal that we should consider for this scale of file processing, especially with these heavy activities in the child workflow?
How should we manage the concurrency and rate-limiting of child workflows effectively, considering the resource-intensive nature of some tasks?
How we can implement this without running into the limitations of the number of child workflows?
The first question you want to answer is how much parallelism you want in the application. You need to process millions of files, but do you need to process all of them in parallel or in smaller chunks? For your use case, processing them with a limited number of parallel ProcessingFileWorkflow instances sounds like a way to go.
There are two separate ways to limit resource usage: limiting the global number of ProcessingFileWorkflow instances or limiting the number of specific heavy activities on each host. You probably want to do both.
Look at the batch-sliding-window example that demonstrates processing a large number of items using a sliding window. Here is a simpler approach using an iterator pattern. The last sample is in Java as it hasn’t been ported to Go yet. To limit specific activities per host, look at the worker.Options.MaxConcurrentActivityExecutionSize.
We need to process them in batches because we are indexing a high volume of data sources.
Yes, we are limiting both of them.
Can we do something like child-workflow-continue-as-new but for the parent workflow so we don’t run into the limitations of the history size in the parent workflow?
Note: some of the activities may take more than 30 min to finish & they are calling another service API, how to make sure that this will not lead to a deadlock?
Look at the samples I linked. They use continue-as-new to work around the history size limit.
Note: some of the activities may take more than 30 min to finish & they are calling another service API, how to make sure that this will not lead to a deadlock?
I don’t understand why you think that a long running activity can cause a deadlock. Temporal doesn’t allow calling external APIs from workflow code directly. Violating this might cause the deadlock detector to fire.
Are you calling an API that takes 30 minutes synchronously?
Yes, because we need the output from the call to do the next activity.
Sorry, this doesn’t make sense to me. What protocol do you use to make 30-minute synchronous call? And what does it have to do with the sequencing of activities?
Note: we are calling the API from the activity, not the workflow.
One of our gRPC services is responsible for parsing document content, the document can be a fully scanned document with more than 1000 pages to OCR.
This can take 2-30 mins, and the parsed content will be stored in a Minio bucket, and the ObjectName will be returned so that the remaining activities can take the content from Minio and perform classification and keywords extraction.
But, after having a second thought I think we can implement another workflow and trigger it whenever parse content is finished so it can do the remaining activities.
Sorry, I think I’m not very clear about my question.
AFAIK you cannot make 30-minute synchronous gRPC call. What is the protocol between you and the parsing service that supports such long running operations?
Can you host a Temporal activity inside that service? Temporal activities support long running operations out of the box through heartbeating.
We have few files of this kind, so out of 1 million files, you may get less than 20 files which will take this much time. Most of the files take less than 1 minute to be parsed.