Efficiently Stream and Process Product Data Using a Sliding Window Approach

shubham016 · August 28, 2024, 2:28pm

I have a file containing data for multiple products, and my objective is to group all relevant data for each product together to process it efficiently. Instead of downloading the entire file, I want to stream the data directly to avoid loading it all into memory.

My plan is to use a sliding window approach, processing the data in small, manageable segments as it streams in, ensuring that each product’s data is handled appropriately.

Let’s say: I have a file
Json0
Json1
Json2
Json3
Json4
Json5
Json6
Json7
Json8
Json9

I have a batch size of 2, now I want to stream this file and get 2 records out of it, by grouping them, using some conditions, i.e.
Group-1 → Json0, Json1, Json2 (grouping first 3 lines)
Group-2 → Json3, Json4 (grouping next 2 lines)
Now sliding window will process these two groups.

Then in next turn the file will be streamed again to generate next two records i.e.
Group-3 → Json5, Json6, Json7 (grouping next 3 lines)
Group-4 → Json8, Json9 (grouping next 2 lines)
Now sliding window should process these groups, and so on…

I’m looking for recommendations on the best approach to achieve this, balancing efficiency and memory usage while processing the data in real time.

maxim · August 28, 2024, 11:15pm

Use some existing big data framework for this. Temporal doesn’t have built in support for such use case.

shubham016 · August 29, 2024, 7:43am

So there is no recommended way for the scenario, to stream the file, get the data and process them in batch using sliding window, like using child workflows or something ?

Also, if we use big data to handle the streaming part, what is the best way to handle the scenario using big data streaming and temporal.

maxim · August 29, 2024, 6:29pm

Temporal is useful when processing records in a file requires external API calls that can fail or take a long time. It looks like “processing,” in your case, is just aggregating data that is found in the file.

Also, if we use big data to handle the streaming part, what is the best way to handle the scenario using big data streaming and temporal.

Use Temporal as the control plane for the big data.

Topic		Replies	Views
Need help on batch file processing Community Support java-sdk	16	1278	October 25, 2022
Best way to streaming data between activities in Temporal Community Support activity , typescript-sdk	6	1684	August 2, 2024
I want to handle a huge amount of data received over an API using temporal? Community Support java-sdk	0	263	October 3, 2023
Aggregating many JSON files into Parquet Community Support	19	1960	December 10, 2020
Workflow orchestration for data pipeline activities Community Support	8	3666	July 12, 2022

Efficiently Stream and Process Product Data Using a Sliding Window Approach

Related topics