Why use Temporal over a combination of AWS Step Functions and AWS Lambda?

isobel · August 2, 2020, 6:45pm

I’ve played around with Temporal for a bit and like the functionality it provides. However, for many business workflow use cases, AWS Step Functions seems to provide the same benefits as Temporal but without the maintenance overhead of installing and maintaining Temporal service. If there is no need for external signals or creating child workflows, it looks like AWS Step Functions is the better option in almost every regard.

Is this understanding correct? Apart from the two points mentioned above, what are the areas where Temporal is better than AWS Step Functions?

=== Edits/Additional Information ===
Addition 1: While Step Functions themselves are not meant for business logic definition via code, the typical approach involves defining Step Functions workflows that call out to AWS Lambda functions (which can define business logic code).

ryland · August 2, 2020, 9:03pm

I think @maxim will have a lot to say here as well so I’m tagging him. While StepFunctions and Temporal may look very similar in the problems they aim to solve they are completely different classes of solution.

You’re right in noticing that StepFunctions provides some guarantees around reliability and consistency that are comparable to Temporal, but that’s where the similarities end. Temporal enables you to implement business logic with pure code. This is one of the biggest (if not the biggest) differentiators between Temporal and any other solution on the market. When I talk to users who have switched from another solution to Temporal, this is the number one reason I hear. Keep in mind that this isn’t just a matter of preference. Many users migrated from StepFunctions because it was not suited for applications of even moderate complexity. This is because the JSON based system that StepFunctions offers cannot efficiently express complex logic.

Outside of this fundamental limitation, there are countless features and capabilities (such as Visibility) API that Temporal offers but StepFunctions does not. It’s also important to note that Temporal is open source and StepFunctions is not, so testing locally will never be as reliable with StepFunctions.

It seems like your biggest hesitation with Temporal is having to host it yourself. Short term we are more than happy to guide you through process of setting up Temporal for your use case. Longer term, we are actively working on a hosted offering which should be available in the relatively near future.

isobel · August 2, 2020, 9:28pm

Thanks for the response @ryland.

It’s true that Step Functions by themselves can’t be used to implement business logic in pure code. However, the typical approach is to tie together Step Functions with AWS Lambda (which can define business logic in pure code) to create a workflow.

ryland · August 2, 2020, 10:59pm

My pushback would be that all of that “tying together” is also your business logic and it too will grow in complexity quickly. The moment you need to express any dynamic branching or decision logic StepFunctions will not be a fun or practical solution. There are some other things I thought of:

Not really possible to fan out Lambdas
Limited length of history
Hard scaling limit (want to say it’s 100k executions)

maxim · August 3, 2020, 3:42pm

Disclamer: I never used or even spent much time learning Step Functions. So correct me if I misrepresent them.

There are multiple dimensions that can help you decide to use one solution over another. Some of them are:

Developer experience
Ability to support current and future use cases.
Unit and integration testing
Updating workflows
Deployment and operations
Performance and scalability

Developer Experience

The overwhelming feedback from Temporal users is that workflow as code is much friendlier way to write business logic than JSON/YAML/XML. My view is that JSON is good for declarative definitions. But Step Functions are procedural. Any procedural code becomes a mess as soon as it is converted to a configuration language. For example, it has to embed some expression language or as in the case of the Step Functions requires developers to specify expressions as a bunch of JSON nodes.

Here is an example I took from the Step Functions documentation:

"ChoiceStateX": {
  "Type": "Choice",
  "Choices": [
    {
        "Not": {
          "Variable": "$.type",
          "StringEquals": "Private"
        },
        "Next": "Public"
    },
    {
      "Variable": "$.value",
      "NumericEquals": 0,
      "Next": "ValueIsZero"
    },
    {
      "And": [
        {
          "Variable": "$.value",
          "NumericGreaterThanEquals": 20
        },
        {
          "Variable": "$.value",
          "NumericLessThan": 30
        }
      ],
      "Next": "ValueInTwenties"
    }
  ],
  "Default": "DefaultState"
},

"Public": {
  "Type" : "Task",
  "Resource": "arn:aws:lambda:us-east-1:123456789012:function:Foo",
  "Next": "NextState"
},

"ValueIsZero": {
  "Type" : "Task",
  "Resource": "arn:aws:lambda:us-east-1:123456789012:function:Zero",
  "Next": "NextState"
},

"ValueInTwenties": {
  "Type" : "Task",
  "Resource": "arn:aws:lambda:us-east-1:123456789012:function:Bar",
  "Next": "NextState"
},

"DefaultState": {
  "Type": "Fail",
  "Cause": "No Matches!"
}

Note the issues with the above code:

Extreme verbosity
All variables are just references into JSON.
State is implicitly passed from one step to another

Here is the above sample rewritten using Temporal Java SDK:

      if (type.equals("Private")) {
        lambda.foo(value);
      } else if (value >= 20 && value <= 30) {
        lambda.bar(value);
      }

I know that number of lines is not the best indicator of simplicity, but 50 lines of JSON are reduced to 5 lines of Java. And now compare the time it took someone to write and read the JSON version versus Java one.

Note that in Java case all the variables are typed and validated at compile time. Note that a lot of complexity of business workflows is in data manipulation, passing, and expressions. And Step Functions language doesn’t help here at all.

I’ve seen production workflows from thousands of lines of pretty complex stateful code. I cannot imagine a developer that would want to convert them to 10x lines of JSON.

Besides code itself the developer experience include tooling. I know that you can draw StepFunction graphs, but as such graphs hide most of the complexity of state and parameter manipulation they are not really that useful for understanding and authoring. Compare it to Java or Go IDEs that are really robust tools these days.

What about refactoring? Imagine the time it would take to change a type of value from integer to a structure with two fields. In Java using an IDE, such refactoring can be done in minutes across tens of thousands of lines of code. As most of the variable access in Step Functions is done from lambdas using JSON path expressions I cannot even imagine the effort to do so.

And there are a lot of other language-specific tools that developers use every day to write and troubleshoot code which are lost with JSON. Is it possible to step through Step Functions code in a debugger, for example?

Ability to support current and future use cases

As Step Functions are very limited in their functionality it is very easy to run into a situation when some new requirement is not directly supported or requires some unnatural workarounds.

As in the most real-life use cases, it is impossible to guess the future requirements. Thus with Step Functions it is possible to run into situation when your entire application has to be rewritten using some other technology as new requirements appear.

Do you really want to end up in such a situation?

The ability of Temporal to accommodate almost an infinite number of requirements and use cases is already proven. And it is not a surprise as general-purpose programming language code is infinitely flexible.

It is even possible to implement Step Functions DSL on top of Temporal. Obviously the reverse is not.

Unit and integration testing

AFAIK Step Functions support unit and integration testing through a docker container.

Temporal does support both using standard programming language frameworks. For example JUnit in Java. In my opinion, using a standard framework provides superior experience due to IDE integration and general flexibility.

Updating workflows

Step Functions support versioning of workflow definitions. But they don’t support any updates to already running workflows. So good luck with fixing a bug in a long-running workflow that was started yesterday. This makes Step Functions not very practical for long-running business processes.

Temporal allows updating already running workflows. There are many use cases that use very long-running workflows.

Deployment and operations

Step Functions are hosted and only support Lambda invocations. This is great for some use cases and can be a huge cost and performance burden for others.

Temporal Technologies is going to provide a hosted version of the service. But until then if you want to use fully serverless solution the Step Functions have an edge.

If you want more control over your application deployment due to security and other reasons the open source nature of Temporal is very valuable.

Temporal provides features that are essential for large scale production use in critical applications. For example, it supports workflow state rollback in case of bugs or over issues. It supports indexing workflows by custom attributes that are updatable during workflow lifetime. And it supports updates to multiple workflows selected using these indexes. Is it possible to signal to terminate all workflows that belong to a certain business category using Step Functions API?

Performance and scalability

I’m no expert here, but I heard from Temporal users that due to need to rely on lamdas for everything Step Functions step-to-step latency is not great.

There are also other limits that are much stricter in Step Functions. For example, size of input and output of an activity invocation is limited to 32k vs 2mb in Temporal. Or history size limited to 25k events with 200k events in Temporal.

maxim · August 3, 2020, 4:18pm

SAGA

SAGA pattern is very widely used for microservice orchestration. Step Functions make it really hard to implement it. The main problem is that in any nontrivial workflow, the list of executed activities is very dynamic. So the compensation logic has to know which activities were executed and compensate them accordingly.

Look at this trivial workflow example:

    Saga saga = new Saga(sagaOptions);
    try {
      if (reserveCar) {
        String carReservationID = activities.reserveCar(name);
        saga.addCompensation(activities::cancelCar, carReservationID, name);
      }
      if (bookHotel) {
        String hotelReservationID = activities.bookHotel(name);
        saga.addCompensation(activities::cancelHotel, hotelReservationID, name);
      }
      if (bookFlight) {
        String flightReservationID = activities.bookFlight(name);
        saga.addCompensation(activities::cancelFlight, flightReservationID, name);
      }
    } catch (ActivityFailure e) {
      saga.compensate();
      throw e;
    }

Now think how to represent this using Amazon States Language.

Then extrapolate it to a real application with potentially thousands of possible code paths, and the list of compensation activities to run is different for every path.

Dylan_Albertazzi · August 5, 2022, 11:08pm

I have nothing against Temporal, it’s a great product. But this article paints step functions in an untrue light. I’ve spent a lot of time with them and built high load step function workflows for Fortune 500s.

I noticed this thread was updated a couple of years ago, now Step Functions can accomplish almost everything that is mentioned above.

There is dynamic branching for complex workflows.

Clean workflow definitions (not 50 lines) using a tool like serverless framework.

A hard limit quota was mentioned… you can ask for increases.

Support for future use cases.

Performance… I’d argued that if your passing more than 256kb between steps you could solve the problem more elegantly.

Being in the AWS ecosystem can be a huge win if that’s where the rest of your cloud infra is.

Eduard_Hasa · November 6, 2023, 6:49pm

Really interested in getting an updated version of this blog post. I know a lot has changed between both platforms temporal and aws step functions since 2020. Can we get an updated post on a comparison between the latest versions?

Faheem · December 2, 2024, 4:24pm

Temporal offers more flexibility, reliability, and scalability:

State Management: Temporal provides durable, fault-tolerant state tracking for long-running workflows without requiring external storage or complex integration.
Retry and Error Handling: Advanced retry policies, compensation logic (Saga), and granular failure handling are built-in.
Complex Workflow Logic: Temporal supports dynamic workflow structures, parallelism, and human-in-the-loop workflows, surpassing the rigid structure of Step Functions.
Cost Efficiency: Temporal avoids idle state costs as workflows run event-driven on worker nodes, unlike Step Functions’ charge per state transition.
Developer-Friendly: Temporal uses familiar programming languages and tools (e.g., Go, Java) for workflows, whereas Step Functions require JSON-based definitions.

Conclusion:

Use Step Functions + Lambda for simpler, event-driven workflows tightly integrated with AWS services.
Use Temporal for complex, long-running, multi-cloud, or hybrid workflows.

maxim · December 2, 2024, 7:22pm

I don’t agree with “Use Step Functions + Lambda for simpler, event-driven workflows tightly integrated with AWS services”. Temporal works well for these scenarios as well.

Faheem · December 3, 2024, 4:18pm

Thank you, Maxim!
You’re absolutely right—Temporal works well for simpler, event-driven workflows and can replace Step Functions + Lambda in many scenarios.

That said, Step Functions + Lambda can still be a practical choice if full visibility, stack traces, or Temporal’s advanced features aren’t required. The only thing is added complexity, but if we need to obtain the benefits that temporal provides, we would have to deal with the complexity in our code as well!

sdonovan · December 17, 2024, 5:03pm

Step-Functions are simpler to get started with, absolutely. The challenge, is that they can become really complex, really fast. With Temporal, you can get a single deployment, say, into a Java/K8s container. With Step-Functions, you’re always going to be dealing with multiple AWS assets (Step-Function definition(s), Lambda, permissions, etc.). Though, I suppose you don’t HAVE to use Lambda in Step-Functions, you can still pull/poll from a K8s service.

As soon as you need non-trivial error-handling, cross-team service/team work, signaling, etc., it becomes incredibly complex. We went through exactly that. We managed to migrate 10,000s of clients from a colo into AWS using Temporal, which had a lot of fan out/in logic, lots of failure points, and bazillions of workflow/activity actions. Admittedly, we had a large use-case (we did look at Step-Functions, but it wasn’t up to task and would have been massively expensive). I also have a sibling team that started a project with SWF and now regrets it, and is switching to Temporal.

Or, if the concern is the AWS integrations into Step-Functions, all of that is easy to build into Temporal, we do it with EC2, SSM, Route53, S3, etc.

Another advantage: unit-integration testing with Temporal is awesome. We build such tests to prove error-handling, cancelation logic, etc. As logic (workflows) evolve, this becomes incredibly important.

Topic		Replies	Views
Temporal instead of step funtions and Lambda for Infra provisioning Community Support	2	830	November 14, 2022
Temporal on AWS infrastructure Community Support aws	1	821	March 19, 2021
Chilling Temporal Anti-Patterns 👻 Developer Corner community-live	0	703	October 16, 2024
Any plan to make it to AWS Marketplace? Community Support aws	2	752	December 17, 2021
Exploring Temporal orchestration with async lambdas Community Support general-impl , async , signals , design , typescript-sdk	5	2358	November 18, 2022