Best way to streaming data between activities in Temporal

Hung_Nguyen · July 30, 2024, 4:29am

Hi there,

I am using Temporal to perform activities in sequence, it is ok if activity A receives a batch of data and then returns a transformed batch to activity B.
But the problem is if the batch is large to handle in one worker, for example the batch may contain million records, so is there any way to do the stream processing on Temporal without reach the history limit? And is this use case suitable with Temporal design?

Example use case: I have an external data source that will send 1 mil records to our temporal worker by trigger a new workflow, and in that workflow we need to perform some data transformation.

Current approach: we pass all 1 mil records into each activity, which is not a good idea
Another approach: We store the data into S3 and pass the key into each activity, which can handle large amount of data, but the S3 IO is a lot, and it take times.
Expected approach: It would be great if we could pass each record into each activity one by one, like map-reduce.

And if there is anyone has the same use case so please share your approach to handle this situation, many thanks!

maxim · July 30, 2024, 1:40pm

You can start both activities in the same process and connect them through an in memory stream.

awwx · July 31, 2024, 1:52am

How large is a batch of data, in MB or GB?

How is the data delivered to Temporal? Does the external data source send the data to Temporal, such as through a workflow signal or update, or does an activity retrieve the data?

Hung_Nguyen · July 31, 2024, 3:29am

Thanks, but it seems we cannot share the workloads to multiple worker if follow this approach

Hung_Nguyen · July 31, 2024, 3:36am

The biggest batch can be 10GB.

How is the data delivered to Temporal? Does the external data source send the data to Temporal, such as through a workflow signal or update, or does an activity retrieve the data?

It currently was sent when trigger the new Temporal workflow, but ofcourse because we are in experiment with Temporal so we did not send that huge batch and are trying to find an other way to stream data into Temporal workflow. We can use signal, but it seems it have limitation on workflow history size, and also not sure using signal is suitable for streaming data?

awwx · July 31, 2024, 4:40am

No, you can’t route large amounts of data through the Temporal server, it’s not designed for that. You need to store the data somewhere, such as in S3 as you mentioned, and then in the signal you can pass a reference to the data (such as a filename, or a S3 bucket and key).

You mentioned that the batch might be too large for a single worker process to handle? So you’d need to split the batch into chunks, and have multiple workers process the chunks in parallel? Something like MapReduce, perhaps?

Temporal doesn’t implement MapReduce itself. You could process the batch using a MapReduce system, and use a workflow to control the MapReduce system through its API.

antonio.perez · August 2, 2024, 2:25pm

Hi @Hung_Nguyen

for this

You can start both activities in the same process and connect them through an in memory stream.

Thanks, but it seems we cannot share the workloads to multiple worker if follow this approach

you can use the worker-specific task queue approach.

Antonio

Topic		Replies	Views
Processing partial / streaming result from activity Community Support go-sdk	2	538	August 4, 2024
Workflow orchestration for data pipeline activities Community Support	8	3666	July 12, 2022
Validation of app architecture using temporal Community Support go-sdk	10	1355	October 18, 2021
Batch-Jobs in Cadence Community Support cadence	11	2307	April 22, 2021
Problems emitting 10,000 activity tasks in a workflow Community Support	3	551	October 9, 2020

Best way to streaming data between activities in Temporal

Related topics