I have a file containing data for multiple products, and my objective is to group all relevant data for each product together to process it efficiently. Instead of downloading the entire file, I want to stream the data directly to avoid loading it all into memory.
My plan is to use a sliding window approach, processing the data in small, manageable segments as it streams in, ensuring that each product’s data is handled appropriately.
Let’s say: I have a file
Json0
Json1
Json2
Json3
Json4
Json5
Json6
Json7
Json8
Json9
I have a batch size of 2, now I want to stream this file and get 2 records out of it, by grouping them, using some conditions, i.e.
Group-1 → Json0, Json1, Json2 (grouping first 3 lines)
Group-2 → Json3, Json4 (grouping next 2 lines)
Now sliding window will process these two groups.
Then in next turn the file will be streamed again to generate next two records i.e.
Group-3 → Json5, Json6, Json7 (grouping next 3 lines)
Group-4 → Json8, Json9 (grouping next 2 lines)
Now sliding window should process these groups, and so on…
I’m looking for recommendations on the best approach to achieve this, balancing efficiency and memory usage while processing the data in real time.