Our first Temporal Spooky Story™ comes from Gabriel Harris-Rouquette, Senior Software Engineer at Merit, (aka, “Lead Awesome-ifier of Kafka” ) about the terror of trying to manage long-running, unpredictable workloads with Kafka:
There once was a workflow implemented with Kafka, but a specific consumer of a message had an unbounded amount of work. Instead of pre-calculating and batching the work, it was decided to use consumer pauses, per the javadoc:
For use cases where message processing time varies unpredictably, neither of these options may be sufficient. The recommended way to handle these cases is to move message processing to another thread, which allows the consumer to continue calling
poll
while the processor is still working. Some care must be taken to ensure that committed offsets do not get ahead of the actual position. Typically, you must disable automatic commits and manually commit processed offsets for records only after the thread has finished handling them (depending on the delivery semantics you need). Note also that you will need topause
the partition so that no new records are received from poll until after thread has finished handling those previously returned.
Anyways, this lead to frightening results over time. Pain points overall:
- Large job that had a variable amount of time to process a single Kafka message for an asynchronous workflow had limited visibility into progress, recoverability, and performed somewhat dangerous operations like pausing a consumer to avoid consumer group rebalancing.
- Visibility into the job was minimal, we could only see if a consumer had started working on the message as the side effects
- Failure to process the message meant restarting the whole job from the beginning, which if it fails half way processing a million objects, it’d restart at 0
- Difficulty in setting up an overlap prevention or “duplicate processing” since the messages were just queued, we couldn’t say “hey, we’re already doing this work”, so potential duplicated long running jobs would just get queued, even though half way through an existing running job would already process the new side effects.
Since we’ve migrated to Temporal, we’ve been able to batch process in a durable fashion, without having to implement the plumbing of writing our own state machines.
Thanks so much, @gabizou ! I for one am now truly terrified about all the things that can go wrong when attempting to manage batches without Temporal.
If you or someone you know is interested in learning more about using Temporal for batch processing workflows, here are some resources you can check out!