Share your Temporal Spooky Stories! 🎃

October marks the launch of Spooky Stories, a series of terrifying tales from life before Temporal! :scream: (You can read more about the program in the linked announcement!)

In your pre-Temporal life, have you faced…

  • Lost data or transactions (or worse, large large sums of money) due to outages or downtime?
  • Lost orders, double-charging orders, or other online/mobile order horror stories?
  • Brittle and complex legacy apps that are Jenga Towers of horror – too risky to change?
  • Failed batch jobs that needed a safe way to run and finish?
  • Another problem sure to inspire fear and terror in others?

If so, please contribute it here, and/or join the party in #contributing on Temporal Community Slack!

Our first Temporal Spooky Story™ comes from Gabriel Harris-Rouquette, Senior Software Engineer at Merit, (aka, “Lead Awesome-ifier of Kafka” :wink:) about the terror of trying to manage long-running, unpredictable workloads with Kafka: :scream:


There once was a workflow implemented with Kafka, but a specific consumer of a message had an unbounded amount of work. Instead of pre-calculating and batching the work, it was decided to use consumer pauses, per the javadoc:

For use cases where message processing time varies unpredictably, neither of these options may be sufficient. The recommended way to handle these cases is to move message processing to another thread, which allows the consumer to continue calling poll while the processor is still working. Some care must be taken to ensure that committed offsets do not get ahead of the actual position. Typically, you must disable automatic commits and manually commit processed offsets for records only after the thread has finished handling them (depending on the delivery semantics you need). Note also that you will need to pause the partition so that no new records are received from poll until after thread has finished handling those previously returned.

Anyways, this lead to frightening results over time. Pain points overall:

  • Large job that had a variable amount of time to process a single Kafka message for an asynchronous workflow had limited visibility into progress, recoverability, and performed somewhat dangerous operations like pausing a consumer to avoid consumer group rebalancing.
  • Visibility into the job was minimal, we could only see if a consumer had started working on the message as the side effects
  • Failure to process the message meant restarting the whole job from the beginning, which if it fails half way processing a million objects, it’d restart at 0
  • Difficulty in setting up an overlap prevention or “duplicate processing” since the messages were just queued, we couldn’t say “hey, we’re already doing this work”, so potential duplicated long running jobs would just get queued, even though half way through an existing running job would already process the new side effects.

Since we’ve migrated to Temporal, we’ve been able to batch process in a durable fashion, without having to implement the plumbing of writing our own state machines.


Thanks so much, @gabizou ! I for one am now truly terrified about all the things that can go wrong when attempting to manage batches without Temporal. :sweat_smile:


If you or someone you know is interested in learning more about using Temporal for batch processing workflows, here are some resources you can check out!

1 Like

Another Spooky Story for you all! :smiley:

Please join us in a week, Wednesday, October 9 @ 11:00am Pacific / 2pm Eastern (what is that in my timezone?) for… Chilling Temporal Anti-Patterns.

During this session, our Staff Solutions Architect @Josh_Smith will regale us with tales of common mistakes that we see from folks with prior distributed systems experience as they’re making their initial foray into the world of Temporal. Join this session to learn more about Josh’s “Top 10” anti-patterns…and how to choose a less-perilous path for yourself!

Sign up here: Webinar | Chilling Temporal Anti-Patterns

You know what’s truly spooky? Web crawling! :spider: :spider_web:
You know what’s even spookier? Web crawling, at massive scale. :scream:

Here’s a great article from Java Developer Advocate Steve Poole about how he introduced Temporal to his scraper that crawls over 14M+ open source components at Maven Central to check for inconsistencies in Java API versions.

SPOILER ALERT: The Temporal-ized code ends up more reliable, recoverable, and less complex. :sunglasses:

Today’s terrifying tale involves the ABJECT HORRORS of sweep network operations, inspired by a blog post from @selljamhere, Co-founder and Principal Consultant of Apartment 304 (and interpreted in a spooky manner by my partner in crime, Mango :slight_smile: )


Money Transfers on the Edge of Town

The bank on the edge of town was haunted - or at least, that’s what folks often said.

Most days, the bank’s software worked just fine. But on nights when the moon was full, and the wind rustled in the leaves, and everything felt not-quite-right, things sometimes went very, very wrong.

On one chilly fall evening, the team was wrapping up after a busy day. Susan, one of the developers on the team, had tickets to see a movie at the theater in town. Her final task before leaving was to prepare the account balances file, which had to be uploaded so that transactions could be processed overnight.

When the files were ready, Susan connected to the SFTP server and started the upload.

“Need any help?” one of her colleagues asked.

“I’m all set,” said Susan. “Just waiting on this file to upload, it shouldn’t take more than a few minutes.”

As the last of her colleagues filed out the door, a warning popped up on Susan’s monitor.

UPLOAD FAILED

“Weird,” Susan muttered to herself. She double checked the selected file, and tried again.

UPLOAD FAILED

Her computer was connected to the network, the server was online, and she’d never had an issue with this process before. Susan checked the time and realized with frustration that she’d miss the previews, but might still be able to make it in time for the movie’s opening credits.

UPLOAD FAILED

A few miles away, previews had ended and the movie was just starting. Although the theater was relatively crowded, Susan’s seat sat empty.

UPLOAD FAILED

A light breeze had picked up outside, rustling dry autumn leaves. Moonlight spilled through the office window.

UPLOAD FAILED

The Real Story

While Susan is fictional, her haunting tale is drawn largely from truth. The story is based on a financial institution that migrated its daily sweep process to Temporal.

The financial institution–let’s call them Money Co. to keep it simple–was using SFTP file uploads and downloads several times each day to communicate account balances and transfers between accounts with other financial institutions in their network. While this approach met regulatory needs and generally got the job done, a number of issues regularly caused headaches for Money Co. employees, including:

  • Uploads and downloads failed frequently, often requiring human intervention to fix.
  • The SFTP server would get overwhelmed with requests regularly, meaning Money Co. needed to pause operations until the server recovered.
  • Queries related to these operations could fail if the database was under heavy load.

To make matters worse, Money Co.’s own internal systems were not resilient to delays introduced by the above issues, and the engineering team often found themselves scrambling to fix issues in time to meet strict deadlines.

For a more detailed exploration of this story, please see: Increasing Resiliency in a Banking Sweep System: Incremental migration to a Temporal Workflow

Migrating these operations to Temporal allowed the Money Co team to take advantage of Temporal’s event history to help investigate the cause of any errors going forward. The migration itself was rolled out incrementally, allowing for intentional changes to roll out over time. Ultimately, migrating to Temporal delivered positive impact immediately, and ensured a more stable and resilient foundation for the future.

1 Like

Next up, we have a Spooky Story from Zaq Wiedmann, Sr. Infrastructure Engineer II @ Khan Academy! Another spooky work of fiction that draws from real-life experience. :slight_smile:


The Ghosts of Systems Past

In the dim glow of flickering monitors, the development team huddled around a table littered with empty energy drink cans and discarded cables. The air was heavy with the weight of yet another failed deployment, and the hum of servers struggling under the weight of broken code filled the room.

“Another service just crashed,” Alex muttered, scrolling through an endless stream of logs. “Fourth one this week. I swear this system is cursed.”

“Check the Kafka topics?” Sam asked, massaging their temple as the headache from staring at the screen for hours grew unbearable. “There’s probably a message jammed in there again.”

Alex groaned. “Sifting through those logs is like playing hide-and-seek with ghosts. Every lead goes cold.”

The team had inherited a system composed of dozens of microservices, each teetering on the edge of collapse. When one service fell, it would pull others down in a catastrophic cascade. Logs were scattered across different environments, tracing was non-existent, and debugging felt like trying to decipher the whispers of a haunted house.

“I heard stories about systems like this,” Maya, the newest member, whispered. “They say some developers never escape… that they get trapped, replaying the same errors over and over.”

“That’s just a myth,” Sam scoffed, but there was a tremor in their voice.

Everyone knew they were fighting something bigger than just bad code—a kind of malevolent force that seemed to conspire against them at every turn.

Late one night, when the office was nearly empty and the only sound was the hum of overworked machines, a cold draft swept through the room. The monitors flickered. On the far side of the table, something strange glimmered under the desk light: a dusty, old piece of hardware—a Ouija board, but instead of letters, the markings were arcane symbols of workflows and retries.

“What’s that?” asked Alex, squinting at it.

Maya hesitated. “I think it’s… a ouija board.”

Sam snorted. “Come on, really? We’re not that desperate.”

But Maya leaned in, running her fingers over the strange runes. “It’s said that this can guide us through the chaos. It’s supposed to reveal the hidden flows of our system, give us the answers we can’t see.”

The team exchanged nervous glances. Yet in their exhaustion and frustration, they were willing to try anything.

They gathered around the board, their hands hovering over the planchette. “How do we fix the system?” Alex whispered.

The planchette moved on its own, gliding over the surface in slow, deliberate motions. Diagrams of their architecture began appearing on the screen, as if drawn by an unseen hand. They watched in awe as their chaotic microservices were reorganized into coherent workflows, managed by something called Temporal.

With each movement, the Ouija board taught them more. It revealed secrets of retries, failure handling, and how to maintain long-running processes with ease. The tangled mess of their system began to unravel, and for the first time in months, hope flickered in their hearts.

“This… this is incredible,” Sam whispered, watching as the board demonstrated how they could track failures, manage state, and regain control over their system.

It seemed the nightmare was finally over. The team dove into implementing the changes, guided by the spectral presence of the Temporal Ouija board. Services that once crashed silently were now reporting clear errors. Dependencies were no longer traps waiting to be sprung.

But just as they began to relax, strange things started happening. Workflows that had once been reliable began to misbehave. Data appeared to change on its own, and errors that made no sense emerged from nowhere.

“What’s happening?” Maya asked, her voice trembling.

On the screen, a single message blinked ominously: Non-Determinism Error.

The room grew colder.

Alex’s face turned pale. “This… this is what the legends warned about.”

Suddenly, the planchette on the Ouija board moved on its own again, sliding across the strange symbols with an urgent, frantic energy. The monitors flickered as if trying to speak, and then, a chilling message scrawled itself across the screens:

“You have disturbed the equilibrium. Workflows must be deterministic, or the system will descend into chaos.”

“But we didn’t know!” Sam cried out.

The Ouija board remained silent for a moment before a final word appeared, barely legible on the screen:

“Beware… the boogeyman of non-determinism. He has awoken.”

The lights flickered violently before plunging the room into darkness. The screens went black, and the hum of servers fell silent.

Somewhere in the darkness, the ghost of their system stirred, plotting.


If you would like to avoid the tragic fates of Sam, Alex, and Maya… make sure to check out Workflow Determinism for helpful determinism tips and tricks!