Why use Temporal over a combination of AWS Step Functions and AWS Lambda?

I’ve played around with Temporal for a bit and like the functionality it provides. However, for many business workflow use cases, AWS Step Functions seems to provide the same benefits as Temporal but without the maintenance overhead of installing and maintaining Temporal service. If there is no need for external signals or creating child workflows, it looks like AWS Step Functions is the better option in almost every regard.

Is this understanding correct? Apart from the two points mentioned above, what are the areas where Temporal is better than AWS Step Functions?

=== Edits/Additional Information ===
Addition 1: While Step Functions themselves are not meant for business logic definition via code, the typical approach involves defining Step Functions workflows that call out to AWS Lambda functions (which can define business logic code).

4 Likes

I think @maxim will have a lot to say here as well so I’m tagging him. While StepFunctions and Temporal may look very similar in the problems they aim to solve they are completely different classes of solution.

You’re right in noticing that StepFunctions provides some guarantees around reliability and consistency that are comparable to Temporal, but that’s where the similarities end. Temporal enables you to implement business logic with pure code. This is one of the biggest (if not the biggest) differentiators between Temporal and any other solution on the market. When I talk to users who have switched from another solution to Temporal, this is the number one reason I hear. Keep in mind that this isn’t just a matter of preference. Many users migrated from StepFunctions because it was not suited for applications of even moderate complexity. This is because the JSON based system that StepFunctions offers cannot efficiently express complex logic.

Outside of this fundamental limitation, there are countless features and capabilities (such as Visibility) API that Temporal offers but StepFunctions does not. It’s also important to note that Temporal is open source and StepFunctions is not, so testing locally will never be as reliable with StepFunctions.

It seems like your biggest hesitation with Temporal is having to host it yourself. Short term we are more than happy to guide you through process of setting up Temporal for your use case. Longer term, we are actively working on a hosted offering which should be available in the relatively near future.

7 Likes

Thanks for the response @ryland.

It’s true that Step Functions by themselves can’t be used to implement business logic in pure code. However, the typical approach is to tie together Step Functions with AWS Lambda (which can define business logic in pure code) to create a workflow.

2 Likes

My pushback would be that all of that “tying together” is also your business logic and it too will grow in complexity quickly. The moment you need to express any dynamic branching or decision logic StepFunctions will not be a fun or practical solution. There are some other things I thought of:

  • Not really possible to fan out Lambdas
  • Limited length of history
  • Hard scaling limit (want to say it’s 100k executions)
3 Likes

Disclamer: I never used or even spent much time learning Step Functions. So correct me if I misrepresent them.

There are multiple dimensions that can help you decide to use one solution over another. Some of them are:

  • Developer experience
  • Ability to support current and future use cases.
  • Unit and integration testing
  • Updating workflows
  • Deployment and operations
  • Performance and scalability

Developer Experience

The overwhelming feedback from Temporal users is that workflow as code is much friendlier way to write business logic than JSON/YAML/XML. My view is that JSON is good for declarative definitions. But Step Functions are procedural. Any procedural code becomes a mess as soon as it is converted to a configuration language. For example, it has to embed some expression language or as in the case of the Step Functions requires developers to specify expressions as a bunch of JSON nodes.

Here is an example I took from the Step Functions documentation:

"ChoiceStateX": {
  "Type": "Choice",
  "Choices": [
    {
        "Not": {
          "Variable": "$.type",
          "StringEquals": "Private"
        },
        "Next": "Public"
    },
    {
      "Variable": "$.value",
      "NumericEquals": 0,
      "Next": "ValueIsZero"
    },
    {
      "And": [
        {
          "Variable": "$.value",
          "NumericGreaterThanEquals": 20
        },
        {
          "Variable": "$.value",
          "NumericLessThan": 30
        }
      ],
      "Next": "ValueInTwenties"
    }
  ],
  "Default": "DefaultState"
},

"Public": {
  "Type" : "Task",
  "Resource": "arn:aws:lambda:us-east-1:123456789012:function:Foo",
  "Next": "NextState"
},

"ValueIsZero": {
  "Type" : "Task",
  "Resource": "arn:aws:lambda:us-east-1:123456789012:function:Zero",
  "Next": "NextState"
},

"ValueInTwenties": {
  "Type" : "Task",
  "Resource": "arn:aws:lambda:us-east-1:123456789012:function:Bar",
  "Next": "NextState"
},

"DefaultState": {
  "Type": "Fail",
  "Cause": "No Matches!"
}

Note the issues with the above code:

  • Extreme verbosity
  • All variables are just references into JSON.
  • State is implicitly passed from one step to another

Here is the above sample rewritten using Temporal Java SDK:

      if (type.equals("Private")) {
        lambda.foo(value);
      } else if (value >= 20 && value <= 30) {
        lambda.bar(value);
      }

I know that number of lines is not the best indicator of simplicity, but 50 lines of JSON are reduced to 5 lines of Java. And now compare the time it took someone to write and read the JSON version versus Java one.

Note that in Java case all the variables are typed and validated at compile time. Note that a lot of complexity of business workflows is in data manipulation, passing, and expressions. And Step Functions language doesn’t help here at all.

I’ve seen production workflows from thousands of lines of pretty complex stateful code. I cannot imagine a developer that would want to convert them to 10x lines of JSON.

Besides code itself the developer experience include tooling. I know that you can draw StepFunction graphs, but as such graphs hide most of the complexity of state and parameter manipulation they are not really that useful for understanding and authoring. Compare it to Java or Go IDEs that are really robust tools these days.

What about refactoring? Imagine the time it would take to change a type of value from integer to a structure with two fields. In Java using an IDE, such refactoring can be done in minutes across tens of thousands of lines of code. As most of the variable access in Step Functions is done from lambdas using JSON path expressions I cannot even imagine the effort to do so.

And there are a lot of other language-specific tools that developers use every day to write and troubleshoot code which are lost with JSON. Is it possible to step through Step Functions code in a debugger, for example?

Ability to support current and future use cases

As Step Functions are very limited in their functionality it is very easy to run into a situation when some new requirement is not directly supported or requires some unnatural workarounds.

As in the most real-life use cases, it is impossible to guess the future requirements. Thus with Step Functions it is possible to run into situation when your entire application has to be rewritten using some other technology as new requirements appear.

Do you really want to end up in such a situation?

The ability of Temporal to accommodate almost an infinite number of requirements and use cases is already proven. And it is not a surprise as general-purpose programming language code is infinitely flexible.

It is even possible to implement Step Functions DSL on top of Temporal. Obviously the reverse is not.

Unit and integration testing

AFAIK Step Functions support unit and integration testing through a docker container.

Temporal does support both using standard programming language frameworks. For example JUnit in Java. In my opinion, using a standard framework provides superior experience due to IDE integration and general flexibility.

Updating workflows

Step Functions support versioning of workflow definitions. But they don’t support any updates to already running workflows. So good luck with fixing a bug in a long-running workflow that was started yesterday. This makes Step Functions not very practical for long-running business processes.

Temporal allows updating already running workflows. There are many use cases that use very long-running workflows.

Deployment and operations

Step Functions are hosted and only support Lambda invocations. This is great for some use cases and can be a huge cost and performance burden for others.

Temporal Technologies is going to provide a hosted version of the service. But until then if you want to use fully serverless solution the Step Functions have an edge.

If you want more control over your application deployment due to security and other reasons the open source nature of Temporal is very valuable.

Temporal provides features that are essential for large scale production use in critical applications. For example, it supports workflow state rollback in case of bugs or over issues. It supports indexing workflows by custom attributes that are updatable during workflow lifetime. And it supports updates to multiple workflows selected using these indexes. Is it possible to signal to terminate all workflows that belong to a certain business category using Step Functions API?

Performance and scalability

I’m no expert here, but I heard from Temporal users that due to need to rely on lamdas for everything Step Functions step-to-step latency is not great.

There are also other limits that are much stricter in Step Functions. For example, size of input and output of an activity invocation is limited to 32k vs 2mb in Temporal. Or history size limited to 25k events with 200k events in Temporal.

13 Likes

SAGA

SAGA pattern is very widely used for microservice orchestration. Step Functions make it really hard to implement it. The main problem is that in any nontrivial workflow, the list of executed activities is very dynamic. So the compensation logic has to know which activities were executed and compensate them accordingly.

Look at this trivial workflow example:

    Saga saga = new Saga(sagaOptions);
    try {
      if (reserveCar) {
        String carReservationID = activities.reserveCar(name);
        saga.addCompensation(activities::cancelCar, carReservationID, name);
      }
      if (bookHotel) {
        String hotelReservationID = activities.bookHotel(name);
        saga.addCompensation(activities::cancelHotel, hotelReservationID, name);
      }
      if (bookFlight) {
        String flightReservationID = activities.bookFlight(name);
        saga.addCompensation(activities::cancelFlight, flightReservationID, name);
      }
    } catch (ActivityFailure e) {
      saga.compensate();
      throw e;
    }

Now think how to represent this using Amazon States Language.

Then extrapolate it to a real application with potentially thousands of possible code paths, and the list of compensation activities to run is different for every path.

11 Likes

I have nothing against Temporal, it’s a great product. But this article paints step functions in an untrue light. I’ve spent a lot of time with them and built high load step function workflows for Fortune 500s.

I noticed this thread was updated a couple of years ago, now Step Functions can accomplish almost everything that is mentioned above.

There is dynamic branching for complex workflows.

Clean workflow definitions (not 50 lines) using a tool like serverless framework.

A hard limit quota was mentioned… you can ask for increases.

Support for future use cases.

Performance… I’d argued that if your passing more than 256kb between steps you could solve the problem more elegantly.

Being in the AWS ecosystem can be a huge win if that’s where the rest of your cloud infra is.

2 Likes

Really interested in getting an updated version of this blog post. I know a lot has changed between both platforms temporal and aws step functions since 2020. Can we get an updated post on a comparison between the latest versions?