Workflow with different machine types executing the same activities

Hi,

We’re in the process of moving most of our workflows to Temporal.

In the past we used AWS Batch for long running video rendering jobs (full export, could be >1 hour), and AWS Lambda for short running video rendering jobs (5s of video per job).
We moved the export jobs to Temporal but kept AWS Batch as the “runner” that actually does the exporting (main activity), we have a temporal activity that starts the batch job and another that “waits for job” to finish. The reason we didn’t do the rendering on the temporal worker is because we weren’t sure about if mixing heavy render jobs with simpler activities like sending an email would be smart, and we hadn’t fully worked out a good scaling strategy.
We thought about doing different queues (light/heavy) for this, but decided at the time not to do it because we used Batch before and this seemed simpler.

The other workflow generates a lot of small video chunks in real time, we want to do this as fast as possible which is why we picked Lambda. The user will be waiting for the job to finish while the video is loading. Right now we manage to render them in about 4-8s depending on if the lambda container is “warm”.
We’d also like to move this workflow to Temporal (from Step Functions). If I’d do the same trick as above, I’d just make temporal call a lambda function and wait for it to finish.

But thinking about this got me interested in other solutions than “just send it to lambda”.

I think the best “full temporal” way would just be getting rid of Lambda and Batch alltogether, but my main fear is auto-scaling. The amount of jobs is very irregular, if 10 people export a full video at the same time it’s going to be very busy (we’re still small), but it’s also possible that we’ll get bigger spikes than that which we need to handle. Also downscaling to near-zero would be ideal to keep costs down. For full export jobs taking a minute to scale up a new machine isn’t the end of the world, but for the small render jobs we really need to respond within 10s.
If there aren’t any machines available, waiting for ~1 minute to scale would give a very bad experience.

An alternative idea I had was a combination of fast render machines that are always on + falling back to lambda/batch if those machines aren’t available.
I thought this could maybe be achieved by running 2 different types of workers; one type would execute the job itself, the other would pass it to lambda/batch. The output would be the same.

Any advice would be appreciated, basically I’d like to experiment with specific render machines, but due to speed of autoscaling I might need to fallback to something like lambda if they aren’t available.

Thanks,
Michiel

I thought this could maybe be achieved by running 2 different types of workers; one type would execute the job itself, the other would pass it to lambda/batch. The output would be the same.

You could achieve this by running these two sets of workers on different task queues. Then when scheduling activities into the self-hosted worker you can specify a short ScheduleToStart timeout. Let’s say 5 seconds. This will cause an activity timeout if the worker doesn’t pick it up within 5 seconds. When workflow receives ScheduleToStart timeout type it dispatches the activity to the different task queue for workers that calls the lambda.

1 Like