Workflow Maintainability: Abstract into a state machine vs. standard workflow function

Hi all! Wanted to discuss the topic of state machines and adding an abstraction on Temporal.

I’ve recently joined a sub-team within the product I work on, and we’re in the works of planning a rewrite of workflows utilizing Temporal, and then introducing it to the broader organization.

There is a belief by some in the team that workflow-as-function would become too complex over time and thus it requires a further abstraction. One concern they had is that if the workflows are written as standard functions they might eventually grow to hundreds/thousands of lines and become too complex to unit test.

Their abstraction idea for this is a sort of state machine framework around Temporal, to 1) allow developers to write the workflow without much knowledge on Temporal and 2) attempt to make complex workflows easier to maintain. In this framework, developers will have to define the business logic across separate “states” where each state is a struct implementing an interface:

  • Execute
  • GetNextState
  • GetStatus - returns workflow status while in this state (separate from Temporal’s workflow status)
  • IsFinalState

Then, there is a generic Temporal workflow which loops and calls an activity which calls the current state’s Execute function, then calls GetNextState to determine the next state.

Personally I’d rather write workflows as regular functions and use standard programming organization techniques (like breaking it up into smaller functions) to manage complexity, but I will need to convince others of this :slight_smile:

I’d like to get any thoughts from others in the Temporal community/company on this sort of idea. Has anyone tried something like this yet? Would a state machine framework/abstraction like this make workflows easier to maintain in practice?

And lastly, are there any examples of some more complex Temporal workflows out in the open that we could reference?

If state machines were a better way to organize complex code, then all software, including the Linux kernel, would be written as them. I’m not claiming that there are no situations when they are useful. These are when an event can apply to many states and requires different handling in each state. BTW the Temporal SDKs heavily rely on state machines internally. Activity cancellation is handled very differently if the ScheduleActivityTask command wasn’t sent to the service yet or the activity already started. Here is the activity state transitions diagram:

I’m yet to encounter a situation in which a business application would benefit from state machine notation as opposed to logic specified in a programming language directly. Most of the complexity of nontrivial workflows is not in the sequencing of function calls. It is in executing multiple branches and callbacks in parallel and in state representation. State machine based solutions use a global bag of properties as shared data. It creates a very tight coupling between different parts of the application. At some point, this complexity makes it practically impossible to maintain. IMHO the only advantage of state machines is the ability to generate a diagram from it. But in my experience, this benefit works in trivial cases. Any nontrivial code has too many possible states and the diagram becomes too complex to make the sense of.

Use standard language techniques to deal with complexity. Temporal allows unit testing individual classes and methods.

There is a reason Temporal is popular. And it is that it uses durable execution abstraction that is very generic and scales to any complexity. State machines as a way to specify workflows existed for 50+ years. How many of them are really popular and used to write complex business flows?

2 Likes

Thanks for the reply, Maxim. Good point about most other complex software not using state machines.

Most of the complexity of nontrivial workflows is not in the sequencing of function calls. It is in executing multiple branches and callbacks in parallel and in state representation.

I think this hits the nail on the head. And Temporal pretty much takes care of all these complexities.

@maxim I hate to be a devil’s advocate here, but if state machines weren’t the best solution for authoring workflows, then why do most workflow tools (most of which are more popular than Temporal) essentially use state machines to define workflows?

Zapier, n8n, Pipedream, Make, Tray.io, and numerous other tools - these are all basic, sequential state machines. It’s a bit disingenuous to claim that plain code is better for defining workflows, especially if you need to share & and explain that logic to the rest of the team, including non-developers.

State machines may not be explicitly mentioned everywhere, but they’re implicitly used in a lot of code, especially in many workflow tools. Code is not sufficient for understanding the logic clearly, especially as the complexity increases.

Using this logic any code is a state machine. Thus any Temporal workflow is indeed a state machine.

I’m talking about explicit state machines used to define application logic. None of the tools mentioned uses state machine definition language. They are more like abstract syntax trees with visual representation.

I’m also discussing tools for developers. None of these tools targets developers and can be used only for applications of very limited complexity.

I’d consider all those tools as essentially flow chart builders. They don’t utilize or expose the concept of finite state machines.

On „only“:

I agree about the non-benefit of real world diagrams, however for me state machines allow a program to reason about the code and e.g. easily discover and inspect available state transitions during runtime of a WF instance. So if you can live with the limitation of e.g. sequential execution and for some reason the workflows need to be customizable by poorly trained staff, a simple state machine can be a proper solution.

I created a DSL for that incl. an interpreter that uses temporal to execute such state machines and IMO the two play well together, even if the DSL hides a lot of temporal features. But that’s intentional. It makes it easier to use, assuming the limitations don’t hinder planned use cases.

1 Like

I believe it works well in your case because of the domain specificity. I love DSLs when they are domain specific which allows for hiding most of the workflow definition complexity. I have 0 objections to a DSL that uses state machine definitions if this is a good fit for the domain.

What I advise against is using state machine definitions as a general-purpose workflow definition language.