Efficiently Stream and Process Product Data Using a Sliding Window Approach

I have a file containing data for multiple products, and my objective is to group all relevant data for each product together to process it efficiently. Instead of downloading the entire file, I want to stream the data directly to avoid loading it all into memory.

My plan is to use a sliding window approach, processing the data in small, manageable segments as it streams in, ensuring that each product’s data is handled appropriately.

Let’s say: I have a file
Json0
Json1
Json2
Json3
Json4
Json5
Json6
Json7
Json8
Json9

I have a batch size of 2, now I want to stream this file and get 2 records out of it, by grouping them, using some conditions, i.e.
Group-1 → Json0, Json1, Json2 (grouping first 3 lines)
Group-2 → Json3, Json4 (grouping next 2 lines)
Now sliding window will process these two groups.

Then in next turn the file will be streamed again to generate next two records i.e.
Group-3 → Json5, Json6, Json7 (grouping next 3 lines)
Group-4 → Json8, Json9 (grouping next 2 lines)
Now sliding window should process these groups, and so on…

I’m looking for recommendations on the best approach to achieve this, balancing efficiency and memory usage while processing the data in real time.

Use some existing big data framework for this. Temporal doesn’t have built in support for such use case.

So there is no recommended way for the scenario, to stream the file, get the data and process them in batch using sliding window, like using child workflows or something ?

Also, if we use big data to handle the streaming part, what is the best way to handle the scenario using big data streaming and temporal.

Temporal is useful when processing records in a file requires external API calls that can fail or take a long time. It looks like “processing,” in your case, is just aggregating data that is found in the file.

Also, if we use big data to handle the streaming part, what is the best way to handle the scenario using big data streaming and temporal.

Use Temporal as the control plane for the big data.