I found out about temporal just a few days ago, so my understanding is still very limited and its concepts are not yet settled in my head.
I wanted to ask if someone could check/enhance my thinking for the following use case; so I can identify the appropriate rabbit holes to go down next .
We are building/refactoring a tool (using golang) that processes images for customers. Each customer starts off with a large-ish list of URLs of images that she/he needs us to process.
Things that should happen as fast as possible after initiated manually by a customer:
- Download the list that contains the image URLs (up to 100k)
- Download all images (1-2mb each) in that list and process every one of them (a relatively simple processing, e.g. convert to JPG)
- The processed images should be saved in S3 or else with a public URL.
- We call the batch API (of an external service) with a list of the processed images URLs as an argument. The service behind that API then fetches those images and we are done.
Characteristics of the external batch API (used in step 4 above)
- The batch API is asynchronous. Status can be polled or a webhook can be used to notify the app/workflow when an api call is finished.
- Here is the twist: Only one batch API call can be active at any given time for each customer. But it is no problem to split the list of images URLs into multiple lists and use multiple consecutive calls.
- A batch API call can take quite some time to complete (albeit increasing linearly with number of URLs given as argument).
My current considerations regarding performance
It seems to makes sense to
- decouple download and processing of the images; as well as pushing them to the external batch API.
- download and process the images concurrently resp. in parallel (while batch API calls need to happen sequentially still).
My idea was
- Start the first batch API call after X (e.g. 100) processed images are available in S3
- While the batch API call is in flight, generate as much additional processed images as possible
- When the batch API call is finished, use a list of all the images in S3 that have been processed since the first API call started and do another batch API call with that.
- repeat until finished
My understanding is that I would then end up with a well optimized process that has a single variable X that I can tweak. Can you spot a problem with that approach?
Regarding the first “task” that (as quickly as possible) spits out X processed images.
- How should I go about downloading and processing the images?
- Ideally after the first images has been downloaded, processing can already start - as downloading is more network bound and the processing is bound by cpu/memory.
- Additionally, processing the images should happen without much concurrency (basically no more concurrent than number of CPU cores available).
- But I see no problem in continuing to download the original images while processing is going on?
Regarding the batch API
- How should I think about asynchronously kicking off a batch API call while the download+process task keeps humming along?
- From what I read so far, the iterator workflow pattern resp. continue-as-new might be part of the solution to have those rolling/sequential batch API calls?
I would appreciate any feedback in general on the approach as well.
Thank you for reading all that. I very much appreciate your help.