Use case: Image processing pipeline (with parallel and sequential parts) – Help wanted to check thinking

Hi there,

I found out about temporal just a few days ago, so my understanding is still very limited and its concepts are not yet settled in my head.
I wanted to ask if someone could check/enhance my thinking for the following use case; so I can identify the appropriate rabbit holes to go down next :slightly_smiling_face:.

Use case

We are building/refactoring a tool (using golang) that processes images for customers. Each customer starts off with a large-ish list of URLs of images that she/he needs us to process.

Things that should happen as fast as possible after initiated manually by a customer:

  1. Download the list that contains the image URLs (up to 100k)
  2. Download all images (1-2mb each) in that list and process every one of them (a relatively simple processing, e.g. convert to JPG)
  3. The processed images should be saved in S3 or else with a public URL.
  4. We call the batch API (of an external service) with a list of the processed images URLs as an argument. The service behind that API then fetches those images and we are done.

Characteristics of the external batch API (used in step 4 above)

  • The batch API is asynchronous. Status can be polled or a webhook can be used to notify the app/workflow when an api call is finished.
  • :face_with_monocle: Here is the twist: Only one batch API call can be active at any given time for each customer. But it is no problem to split the list of images URLs into multiple lists and use multiple consecutive calls.
  • A batch API call can take quite some time to complete (albeit increasing linearly with number of URLs given as argument).

My current considerations regarding performance

It seems to makes sense to

  • decouple download and processing of the images; as well as pushing them to the external batch API.
  • download and process the images concurrently resp. in parallel (while batch API calls need to happen sequentially still).

My idea was

  1. Start the first batch API call after X (e.g. 100) processed images are available in S3
  2. While the batch API call is in flight, generate as much additional processed images as possible
  3. When the batch API call is finished, use a list of all the images in S3 that have been processed since the first API call started and do another batch API call with that.
  4. repeat until finished

My understanding is that I would then end up with a well optimized process that has a single variable X that I can tweak. Can you spot a problem with that approach?

Questions

Regarding the first β€œtask” that (as quickly as possible) spits out X processed images.

  • How should I go about downloading and processing the images?
  • Ideally after the first images has been downloaded, processing can already start - as downloading is more network bound and the processing is bound by cpu/memory.
  • Additionally, processing the images should happen without much concurrency (basically no more concurrent than number of CPU cores available).
  • But I see no problem in continuing to download the original images while processing is going on?

Regarding the batch API

  • How should I think about asynchronously kicking off a batch API call while the download+process task keeps humming along?
  • From what I read so far, the iterator workflow pattern resp. continue-as-new might be part of the solution to have those rolling/sequential batch API calls?

I would appreciate any feedback in general on the approach as well.

Thank you for reading all that. I very much appreciate your help.

Cheers
Dom

A few thoughts about your use case. Temporal is not built to pass large payloads as activity inputs and output. So you want to use a local host disk to cache URLs and use session feature toto run activities on a specific host.

Assuming that you want to process images for a single customer on a single host (the design can easily extended to parallel processing) I would use the following workflow design for your use case:

  1. Create a session to ensure all activities run on the same host. The session allows specifying parallelism. I.e., how many sessions per host are allowed to run simultaneously.
  2. An activity downloads the list of URLs to a host and returns the file name.
  3. A separate goroutine executes the following loop
    1. An activity that downloads images from that list. The activity can use more than one goroutine to perform downloads. The activity also processes each image and saves result to S3. The activity heartbeats periodically to ensure that its failure is detected promptly.
    2. After 100 (batch size) successfully uploaded images the activity completes returning the list of image names. If the list is too large, a filename that contains the list can be returned. The list is pushed to the channel as a single batch entry.
    3. The channel is closed after the last batch.
  4. In main workflow function a loop that reads from that channel until it is closed and for each batch entry
    1. An activity that makes the batch API call is invoked.
    2. An activity that waits for the batch to complete by polling the external API.

If the session fails the whole process is repeated on a different host starting from the last unprocessed batch. See the fileprocessing sample that demonstrates the session feature.

If the workflow history becomes too large then it should call continue-as-new. It should try to recreate a session on the same host. If session recreation fails then it can continue the whole process on a different host.

1 Like