I am using Temporal to perform activities in sequence, it is ok if activity A receives a batch of data and then returns a transformed batch to activity B.
But the problem is if the batch is large to handle in one worker, for example the batch may contain million records, so is there any way to do the stream processing on Temporal without reach the history limit? And is this use case suitable with Temporal design?
Example use case: I have an external data source that will send 1 mil records to our temporal worker by trigger a new workflow, and in that workflow we need to perform some data transformation.
Current approach: we pass all 1 mil records into each activity, which is not a good idea
Another approach: We store the data into S3 and pass the key into each activity, which can handle large amount of data, but the S3 IO is a lot, and it take times.
Expected approach: It would be great if we could pass each record into each activity one by one, like map-reduce.
And if there is anyone has the same use case so please share your approach to handle this situation, many thanks!
How is the data delivered to Temporal? Does the external data source send the data to Temporal, such as through a workflow signal or update, or does an activity retrieve the data?
How is the data delivered to Temporal? Does the external data source send the data to Temporal, such as through a workflow signal or update, or does an activity retrieve the data?
It currently was sent when trigger the new Temporal workflow, but ofcourse because we are in experiment with Temporal so we did not send that huge batch and are trying to find an other way to stream data into Temporal workflow. We can use signal, but it seems it have limitation on workflow history size, and also not sure using signal is suitable for streaming data?
No, you can’t route large amounts of data through the Temporal server, it’s not designed for that. You need to store the data somewhere, such as in S3 as you mentioned, and then in the signal you can pass a reference to the data (such as a filename, or a S3 bucket and key).
You mentioned that the batch might be too large for a single worker process to handle? So you’d need to split the batch into chunks, and have multiple workers process the chunks in parallel? Something like MapReduce, perhaps?
Temporal doesn’t implement MapReduce itself. You could process the batch using a MapReduce system, and use a workflow to control the MapReduce system through its API.