I am building an ETL workflow in which the first activity downloads a list of URLs. For each URL an activity will be spawned to fetch the content of the page it references.
The issue I am facing is that the activity which downloads the URLs returns a list of 22k strings, which exceeds the maximum payload size that can be returned by an activity.
Thus my question is, would it be possible to use an interceptor to slice this output before it gets returned to the workflow? If not, what other ways are there to prevent this limit without saving the list to a file?
I don’t think you can slice an activity response into chunks as you’re describing: Temporal will register a single activity completion event for an activity, but it sounds like you’d like multiple.
For ETL use cases, I tend to recommend keeping data that you’re working on in a blob store like S3 (or some other datastore), then passing pointers to this data around in your workflow & activities. Doing so will allow you to work with arbitrary data sizes, and you’ll more easily avoid workflow history size limits.
If you need these items in the workflow–for instance to spawn child workflows per-item–then perhaps try writing an activity that supports passing a token into the activity (as you would do for pagination) and iterate over the activity, returning a subset of the data, until the token is None.
Hey Rob! Thanks for the reply!
I have taken the suggestion and looking into how to paginate the activity.
Another option that I explored was for the activity to return a reference to an iterator object that will return the links one by one. Would it be possible to have such an object stored in the context?
“The context” being the workflow context? You don’t really add stuff into the workflow context directly, but what you’re describing is essentially what I proposed initially, but instead of returning pages, you’re just returning a single item at a time. Personally, I wouldn’t do this, because it would mean 22,000 activity calls: That’s a lot! Not only will it make your workflow slower and bigger (from a history perspective), it’ll also be pretty expensive. I’d recommend batching for those reasons instead.