Temporal for real-time event-based DAG workflows

Hey Temporal community,

Currently I am evaluating using Temporal for an event-driven workflow platform, but I don’t know if the platform is capable of handling my requirements. Therefore, I would kindly ask for assistance.

The following scenario is planned: I want to build an event-driven DAG platform where users can create their own workflows. Those workflows consist of nodes and edges, which can be connected. Those are nodes for data manipulation like aggregation, merge, and split, but also things like OpenAPI or database connectors for fetching external data or offloading data to Spark batch processing for large processing. Overall, to have a general idea, you could think of something like n8n. Each node would be a Temporal action in this case.

The workflows should be triggered via an event coming from either Kafka or Nats io.

  1. However, I don’t know whether this platform is capable of the load. While users can create their own workflows, this means that there are lots of workflows that are being run simultaneously. To give a number, it should still be performant with millions of workflows running at once or being started at once when going with the approach of starting a workflow on an incoming event instead of using a signal to let an already started workflow wait. Is this working with temporal?

  2. Ideally I want to have low latency as near as possible. However, due to the nature of temporal logging every state to the database, I might have some overhead. I think I would have to live with this. Temporal Cloud has write-ahead-log (WAL). Setting up something like this would be impossible on the selfhosted version right?

  3. Those workflows can handle large amounts of data and serve as an ETL pipeline. Temporal actions have a parameter and return limit, which you can send. Because of this, I might need to offload results to an external storage and to an external processor like Spark. But with this approach, I will lose the main argument of Temporal in replaying on failure. When storing the data externally, does it even make sense to use Temporal for such an approach?

Thanks in advance

Greetings

My advice:

  • don’t treat Temporal like a “pipe” for your data, use it as a mechanism for fault tolerant operations that are guided by business logic.
  • The state that you ask Temporal to keep for you (what you give as input and send as output) should be limited to small pieces of data, possibly db ids and the like.
  • Offloading results to external storage is exactly what you should do, Temporal is not a replacement for a persistent store.
  • Whether Temporal will work for you depends entirely on whether your code can be idempotent e.g. can you get the exact same data you tried to get and store before given the same inputs? If the answer is yes, then Temporal will allow you to offload the rather complex problems involved in ensuring your business logic operates in adverse conditions.
  • On incoming “event”, as long as you can start a workflow (or activity) before the event is acknowledged, and the event will be retried if it is not acknowledged then this can be made to work.
  • On scaling, though I don’t have direct experience of millions of workflows, it seems to me your problem will be largely down to providing enough processing power and network bandwidth for your load, and that is definitely out of scope of what’s possible to answer here.