Disclamer: I never used or even spent much time learning Step Functions. So correct me if I misrepresent them.
There are multiple dimensions that can help you decide to use one solution over another. Some of them are:
- Developer experience
- Ability to support current and future use cases.
- Unit and integration testing
- Updating workflows
- Deployment and operations
- Performance and scalability
Developer Experience
The overwhelming feedback from Temporal users is that workflow as code is much friendlier way to write business logic than JSON/YAML/XML. My view is that JSON is good for declarative definitions. But Step Functions are procedural. Any procedural code becomes a mess as soon as it is converted to a configuration language. For example, it has to embed some expression language or as in the case of the Step Functions requires developers to specify expressions as a bunch of JSON nodes.
Here is an example I took from the Step Functions documentation:
"ChoiceStateX": {
"Type": "Choice",
"Choices": [
{
"Not": {
"Variable": "$.type",
"StringEquals": "Private"
},
"Next": "Public"
},
{
"Variable": "$.value",
"NumericEquals": 0,
"Next": "ValueIsZero"
},
{
"And": [
{
"Variable": "$.value",
"NumericGreaterThanEquals": 20
},
{
"Variable": "$.value",
"NumericLessThan": 30
}
],
"Next": "ValueInTwenties"
}
],
"Default": "DefaultState"
},
"Public": {
"Type" : "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:Foo",
"Next": "NextState"
},
"ValueIsZero": {
"Type" : "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:Zero",
"Next": "NextState"
},
"ValueInTwenties": {
"Type" : "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:Bar",
"Next": "NextState"
},
"DefaultState": {
"Type": "Fail",
"Cause": "No Matches!"
}
Note the issues with the above code:
- Extreme verbosity
- All variables are just references into JSON.
- State is implicitly passed from one step to another
Here is the above sample rewritten using Temporal Java SDK:
if (type.equals("Private")) {
lambda.foo(value);
} else if (value >= 20 && value <= 30) {
lambda.bar(value);
}
I know that number of lines is not the best indicator of simplicity, but 50 lines of JSON are reduced to 5 lines of Java. And now compare the time it took someone to write and read the JSON version versus Java one.
Note that in Java case all the variables are typed and validated at compile time. Note that a lot of complexity of business workflows is in data manipulation, passing, and expressions. And Step Functions language doesn’t help here at all.
I’ve seen production workflows from thousands of lines of pretty complex stateful code. I cannot imagine a developer that would want to convert them to 10x lines of JSON.
Besides code itself the developer experience include tooling. I know that you can draw StepFunction graphs, but as such graphs hide most of the complexity of state and parameter manipulation they are not really that useful for understanding and authoring. Compare it to Java or Go IDEs that are really robust tools these days.
What about refactoring? Imagine the time it would take to change a type of value from integer to a structure with two fields. In Java using an IDE, such refactoring can be done in minutes across tens of thousands of lines of code. As most of the variable access in Step Functions is done from lambdas using JSON path expressions I cannot even imagine the effort to do so.
And there are a lot of other language-specific tools that developers use every day to write and troubleshoot code which are lost with JSON. Is it possible to step through Step Functions code in a debugger, for example?
Ability to support current and future use cases
As Step Functions are very limited in their functionality it is very easy to run into a situation when some new requirement is not directly supported or requires some unnatural workarounds.
As in the most real-life use cases, it is impossible to guess the future requirements. Thus with Step Functions it is possible to run into situation when your entire application has to be rewritten using some other technology as new requirements appear.
Do you really want to end up in such a situation?
The ability of Temporal to accommodate almost an infinite number of requirements and use cases is already proven. And it is not a surprise as general-purpose programming language code is infinitely flexible.
It is even possible to implement Step Functions DSL on top of Temporal. Obviously the reverse is not.
Unit and integration testing
AFAIK Step Functions support unit and integration testing through a docker container.
Temporal does support both using standard programming language frameworks. For example JUnit in Java. In my opinion, using a standard framework provides superior experience due to IDE integration and general flexibility.
Updating workflows
Step Functions support versioning of workflow definitions. But they don’t support any updates to already running workflows. So good luck with fixing a bug in a long-running workflow that was started yesterday. This makes Step Functions not very practical for long-running business processes.
Temporal allows updating already running workflows. There are many use cases that use very long-running workflows.
Deployment and operations
Step Functions are hosted and only support Lambda invocations. This is great for some use cases and can be a huge cost and performance burden for others.
Temporal Technologies is going to provide a hosted version of the service. But until then if you want to use fully serverless solution the Step Functions have an edge.
If you want more control over your application deployment due to security and other reasons the open source nature of Temporal is very valuable.
Temporal provides features that are essential for large scale production use in critical applications. For example, it supports workflow state rollback in case of bugs or over issues. It supports indexing workflows by custom attributes that are updatable during workflow lifetime. And it supports updates to multiple workflows selected using these indexes. Is it possible to signal to terminate all workflows that belong to a certain business category using Step Functions API?
Performance and scalability
I’m no expert here, but I heard from Temporal users that due to need to rely on lamdas for everything Step Functions step-to-step latency is not great.
There are also other limits that are much stricter in Step Functions. For example, size of input and output of an activity invocation is limited to 32k vs 2mb in Temporal. Or history size limited to 25k events with 200k events in Temporal.