Chilling Temporal Anti-Patterns 👻

Josh_Smith · October 16, 2024, 5:59pm

(This is a write-up from our recent Spooky Stories webinar: Chilling Temporal Anti-Patterns)

Intro

I was excited to do this talk, because my favorite part of my job is helping people learn how to use Temporal, solve their pain points, avoid perilous pitfalls, and just have fun using Temporal to make being a developer easier.

In this talk, I will use some of my favorite spooky movies to explain Temporal best practices and their corresponding anti-patterns in a fun, top 10 list format. Without further ado…

#10: Wrapping SDKs in Scary Ways

One of the things we observe at Temporal is developers love to wrap libraries (for example, our SDKs) in other libraries, and then wrap those libraries in other libraries. (Hence, the chance for a mummy pun!)

Sometimes, wrapping the Temporal SDKs can be okay. For example, wrapping a thin “shim” layer that allows you to simplify security practices at a company, or make connections to Temporal Cloud easier, or make hooking up to metrics simple for developers. Another useful pattern is when people build a layer on top of Temporal for new developers to do simple workflows really easily (as long as they can also opt to use the SDKs directly). There’s a fantastic Replay talk on this very topic from our friends at Cash App — Great tech is not enough: building trust to get the most out of Temporal. Essentially, adding a little bit to the Temporal SDKs to make development easier is awesome!

The anti-pattern is, if you wrap the Temporal SDK too far, you can end up hiding important features, or making it really difficult to update the SDK as improvements are made upstream. We’ve worked hard to ensure our SDKs are idiomatic, and they’re open source, and if you have suggestions for improvements we’d love to hear them!

In short: Don’t wrap the SDKs too deeply, to the point that they are hard to upgrade or maintain, or to the point where you end up breaking or hiding useful SDK features. We encourage you to work with us if you want improvements!

#9: Jump Scares: Not Done Yet

Idempotency and Local Activities

One of my favourite series of movies is Scream, and if you watch the Scream movies you might know that there’s always a killer (sometimes two!) and without fail, once the killer’s finally defeated, they come back one more time and jump scare everybody in the audience.

How does this relate to Temporal? Sometimes, people think things are done when they’re not done yet. This is important when it comes to things like idempotency and Local Activities.

When you’re writing a function or a method in any programming language, “idempotency” means when you execute it multiple times it always has the same result, so you can execute it multiple times safely.

Local Activities are a variation of Activities in Temporal which run as part of the Workflow Execution process, and in order to reduce latency, they don’t write their completion to Workflow History until they all have completed.

If you put these two together—a Workflow with a series of Local Activities, and whose Local Activities are not idempotent—this can cause surprise consequences.

Take for example a use case of money transfers. You have a series of Local Activities that move money, and you are not correctly using Idempotency Keys to understand if an Activity is called more than once. If your application hits a bug (for example, one of the later Local Activities calls an API that your code relies on goes down after the money is moved by an earlier Local Activity), your Local Activity sequence will keep firing, and end up moving far more money than it’s supposed to, because the Workflow History will never tell it to stop, because the Local Activity series hasn’t completed in full yet.

For this reason, in general I recommend using regular Activities, not Local Activities. But if you do use them, know how they work, and always use idempotency when you’re writing Temporal Activities.

See Idempotency and Durable Execution for best practices around this topic, as well as Max’s community post that lays out the exact execution sequences between Local and regular Activities.

In short: Don’t let your Workflow processing jump out at you and not be done yet when you think it’s done.

#8: Not Using the Time Machine

Use Time Travel to Save your mom your workflows

Like the movie Totally Killer, Temporal also gives you time travel superpowers.

Let’s say you have 100,000 Workflows and they all hit the same bug in one of your Activities. Maybe that bug did some math, and now your calculations are wrong, and all of your customers have 10,000 extra widgets.

But fear not… Temporal has time travel! Every Workflow records every step within it to the Workflow Event History. And you can rewind that using a feature called Temporal Reset. If a Workflow pod in Temporal crashes, its Workflow Event History is also replayed.

So in our example, you would deploy a fix for your math Activity to production, and then reset your Workflows in batch back to an earlier step. Temporal will rewind all of your Workflows back before the problematic Activity was called, and then re-execute them through the same fixed code path, and now the math is correct.

I’ve worked with several people who’ve done this in production with their workflows, and it’s basically magic! You have a bug in production, you can rewind and replay with the bug fixed, and it fixes all of the workflows automatically.

For more information on this feature, see Temporal Time Traveling: Replay.

In short: Use Temporal time travel to save your mom… or your workflows, in this case.

#7: An Overwhelming Amount of Tribbles

Manage Workflow History Size

Temporal’s Workflow History is amazing: it lets you go back in time, it lets you keep track of every workflow you’ve ever done, and it’s very performant. But one of the things you can’t do is have unlimited Workflow History size, or you end up with Star Trek Tribble Trouble.

Workflow History has limits (albeit pretty high) in order to keep Workflow Replay and Reset performant. And new Temporal users can incorrectly assume that the Workflow History can have unlimited size, because Temporal is kind of magic. And unfortunately If the Event History exceeds 50Ki (51,200) Events, the Workflow Execution is terminated.

Here are a couple of ways to keep the size of Workflow History down:

Don’t keep too much data in your Temporal Workflows. If you need to work with large data, access it externally to the Workflow History.
Use Continue-As-New as needed; this passes the latest relevant state to a new Workflow Execution, with a fresh Event History.

In short: Be aware that Workflow History size has a limit, and keep this in mind when designing your workflows.

#6: Crossing the Streams

Test for Determinism

In Ghostbusters movies, there’s a rule: don’t cross the streams of the proton packs. In Temporal, we have a rule as well, which is that Workflows can be rewound, replayed, and reset. To support that, Workflows must be deterministic (meaning, given a particular input, it will always produce the same output).

What you don’t want is to cross one time stream with another time stream and cause weird things to start happening to your timeline. If you need to make a change to Workflow code, or a Workflow that’s already running somewhere, you can do that as long as you don’t break anything deterministically. To support this, we have features Workflow Versioning and Workflow Patching, which are both ways to make changes to your Workflow time stream without breaking existing Workflows in their Replay and Reset.

Note: Activities do not have to be deterministic and you can make changes to them without consequences. Same with a Workflow that’s still under active development. But if a Workflow is running in production, this is where Versioning and Patching come in.

There’s an excellent course on Versioning Workflows which covers this topic more in-depth, and is highly recommended.

A couple of great resources on testing for determinism are:

In short: Don’t cross your streams. Test for determinism and use Versioning to make sure you don’t have any errors when you go to production.

#5: Leaving Handy Tools Sitting There

Use Features like Signals, Updates, Polling…

Sometimes when you watch a horror movie, the heroes run right by something that would be super helpful—a fire extinguisher, a first aid kit, a crowbar, a flashlight—and it’s SO frustrating! “NO! Why?? Why don’t you pick up the thing that would be super helpful to you?!”

I sometimes have the same feelings when I’m helping people out with Temporal. There are so many great Temporal features in our documentation and training you can rely on, such as:

Signals, Queries, and Updates
Best practices for Activity polling (code sample)
Interactive Workflows hands-on course
Examples of how to use Temporal

In short: Don’t forget to “loot the bodies” and take from / learn about all the helpful Temporal features and resources that are available.

#4: The MEGA Workflow

What does Steve McQueen have to do with Temporal? Sometimes people build workflows that try to do everything: model 50 different processes, automate everything that a whole team or organization might want to do in a single workflow. And that can be very hard to manage, and might also eat a car or a train.

There is a better way! Instead of a single “blob” process, model processes that are simple and straight-forward. If a process kicks off another process which kicks off yet another process, you’re probably looking at three workflows rather than one big workflow that does it all.

How to tell when you might be facing a “blob” workflow? Here are some questions to ask yourself include:

Can you keep this whole process in your head at one time?
Does it adhere to the principles of Domain-Driven Design?
Can it be supported with a two-pizza team?
If something goes wrong, is it very easy to track down by workflow name which part of the application was the culprit?

If you answered “no” to one or more of these, this starts to create some nervousness around operating in production and maintaining the code long-term. It might be better to take certain parts of it and push it to a “sub” workflow or a workflow that you call separately. Another approach is to abstract concepts away, such as putting a particularly complex piece into its own function and not in the main Workflow code.

There’s also a new feature in Temporal called Nexus which was talked about during the Replay 2024 keynote. Nexus makes breaking out complex processes into simpler steps and management of processes between teams a lot easier.

In short: Check out Nexus, workflows should model one process, and you can break sub-processes into separate workflows and use standard distributed systems patterns for managing them.

#3: Arguing With Yourself

Let Workflows Manage Process Status

Sometimes people really love state machines, and they want to have a state machine in their Temporal Workflow. There are valid use cases for this, but I’ve noticed that sometimes the state machine can have a list of the next possible states and valid states to transition to, and the workflow can have the same thing, and sometimes they can be in conflict. At minimum, it’s duplicate maintenance. But worst-case, the state machine may block valid workflow progress that could otherwise continue if there was no state machine. This can happen in the case of a bug or unanticipated state in the state machine, for example.

A great resource about the general area of state management and Temporal is our State Machines Simplified whitepaper.

If you’re going to use Temporal to model your processes and model process state, you probably don’t need an extra state machine. Instead, use the code-first development of Temporal to make State Management really easy.

In short: Don’t have split-brain; use Temporal to manage process State.

#2: Hiding Behind the Chainsaws

Erroring Workflows / Activity Error Handling

From a classic Geico ad about horror movie characters making decisions that get them killed too soon. With Temporal, don’t make decisions that kill your Workflows too soon!

By default with Temporal, when you call an Activity, it will retry forever. And this is fantastic in the situation where you’re calling an external system such as an API or a database, because if it fails you don’t need to write any retry code. You just write the code to call the external system, and Temporal will automatically retry activities until they work. This is one of Temporal’s core strengths, and it can really simplify the code you write.

There are instances where this approach is suboptimal, however. For example, a workflow needs to finish really quickly or respond super fast so as not to hold up other work. Examples might be a workflow that blocks a UI response, or workflows that must finish within a certain time, such as a daily report. For this type of use case, Temporal allows you to customize the Retry Policies—for example, to only make 3 retry attempts, or stop retrying after X seconds.

If your Workflow can handle responding in the fastest time (seconds) as well as a longer time, such as 10 minutes or more, and this is okay for your business process, don’t customize your retry policies! Because if you do these customizations and the Activity fails, even in such a way as it could recover on subsequent retries, your entire Workflow will fail.

So ask yourself: Do you have to try that Action only 3 times? Would it be better if the Workflow eventually succeeded? If so, don’t worry about customization, and let Temporal smart defaults handle it for you!

In short: If it fits your business processes, use the default settings to let Temporal Activities infinitely retry. Your workflows will then automatically succeed and won’t die too soon, unlike the young people in this commercial.

#1: Hiding In a Room With One Exit

Use Compensation to Give Yourself a Successful Outcome

I’m sure you’ve all seen a movie where someone’s running away from a bad guy, and they run into a room and they hide, and then the bad guy walks through the one door and then they’re trapped. Workflows can be like this too! You can write a Workflow that optimistically assumes everything will work out, but what if something doesn’t work?

When first starting out, it can be tempting to say, “Well, if something goes wrong, I’ll just fail the Workflow, and that’ll be the end of that.” But an interesting thing about Temporal is we make it really easy to elevate your thinking: If something goes wrong, can we still set things right?

As a concrete example, let’s say you’re writing a process that does booking for a rental car, a hotel, and a flight. Let’s further suppose that the rental car booking went fine, the hotel booking went fine, but something goes wrong with the flight and it can’t be booked. If you don’t give yourself a way out, a way to succeed, you might stop there and say the process failed. But now, the user has an error, and also two reservations they can’t use.

Fortunately, there is a technique in programming called compensation, sometimes talked about in the context of doing a Saga Pattern. This means that you effectively “undo” the actions that preceded the failed action. In our booking example, we would fail on the airline reservation, and then un-reserve the hotel and rental car.

As mentioned in the previous point, for technical failures, Activities handle these very well, for example if an API is down or a database is slow temporarily. However, Temporal also allows you to think at a higher level about how to manage business failures, such as not being able to make a booking because the plane is full, or a money transfer from Account A to Account B failing because Account B doesn’t exist or is invalid.

Temporal allows you to ask a business stakeholder, “If I can’t do step 2, what should I do about step 1?” The correct thing to do in these situations is to cancel the reservation and put money back into Account A—these are “success” from a business model point of view. And Temporal lets you program the answer into your Workflows so that they can always succeed.

In short: Give yourself an out, use compensation to have a backup plan in case your main plan fails. (And watch the Scream movies. :-))

But that’s not all! There’s also…

Bonus Spookiness!

These are not necessarily Temporal anti-patterns, but more horror-themed suggestions.

Bonus #11: Wandering into a Dark Alley

Use Metrics & Visibility Built Into Temporal

Pro-tip: Don’t be the horror actress who walks down a dark alley. Temporal offers lots of ways to help you get out of the dark and understand how your Workflows are working:

Temporal SDK Metrics to monitor individual workers and your code’s behavior
Temporal Cloud Metrics to measure the health and performance of Temporal infrastructure
Temporal Web UI with Workflow Execution state and metadata for debugging purposes
Temporal Visibility which allows you to set Search Attributes on your workflows and view, filter, and search for Workflow Executions that have a certain status or important attribute

Bonus #12: Vampires Sucking Up All Your Resources

You don’t want vampires to suck up all of your blood, nor do you want Temporal to run out of capacity. For folks who are self-hosting, you can have a workload on your Temporal server that ends up using a lot more resources than you might expect, and this can cause major performance problems.

If you find yourself impacted by this, use rate limits to prevent “Noisy Neighbour” problems. The community posts Rate limit configuration and best practices and Rate limiting by Namespace allude to strategies you can use here.

(Alternately, move to Temporal Cloud and our SaaS configuration will automatically make sure vampires aren’t sucking up your resources. )

Bonus #13: Splitting The Party!

If you’re a fan of the X-Files, you know that Mulder (the guy who believes in aliens) and Scully (the skeptical scientist) frequently don’t stick together, which leaves Mulder seeing aliens and Scully not, and it drives me nuts.

So don’t go it alone, don’t split the party. Join the community, work with us, let’s build some awesome applications together!

Topic		Replies	Views
Spooky Stories: Tales from the Temporal Trenches (Part 1) Developer Corner community-live	0	98	November 27, 2024
Share your Temporal Spooky Stories! 🎃 Show & Tell	5	340	October 15, 2024
Design Patterns Community Support general-impl , polling	4	4519	March 27, 2022
Why use Temporal over a combination of AWS Step Functions and AWS Lambda? Tech Comparisons use-case-validation , comparisons	11	24298	December 17, 2024
Validation of app architecture using temporal Community Support go-sdk	10	1327	October 18, 2021