DSL Workflow scalability

Hello everyone,

We are currently evaluating Temporal to be used as both for Microservices Orchestration and to back a Business Process Engine.

For some extra context, our first priority would be the BPE. It all starts with an UI, where external parties can define workflows, then we will later translate into a DSL and store it. Then the DSL would be interpreted by the Workflow Engine which will then manage the process execution. Now, although Camunda or Flowable may be fit for the job, we are looking past the immediate need and we would like the same engine to be able to run internal workflows as well, for our developers (and solving problems such as sagas, scheduling, recurrence).

Diving down in the BPE example, I am thinking about a model where we have a DSL interpreter workflow which is triggered each time with a different DSL definition (possible related discussion - UI for creating workflows). Therefore, each time, we would start a workflow with a different workflow id, but all of them would make use of the same task list. Some of these workflows could be recurring, or long running, therefore we may have to maintain a lot of them active, but with a smaller number of signals or queries / workflow.

Taking a step back and looking at scalability, we have found a couple of examples in the documentation: Temporal can easily handle 100 millions running workflows each of them taking in 20 events / second, but is not running a single workflow with 100 millions events / seconds. Now, it is still unclear to us where the bottleneck is. Would running the same workflow multiple times have an impact on performance, maybe due to the fact that all its runs share the same task list?

Do you have any specific concerns with this approach, in terms of feasibility or scalability? I am especially concerned about the task list, which would be shared across all these executions. Currently I have seen some work around this in this issue: https://github.com/uber/cadence/issues/2473, but I have not further investigated how it works.

We are currently moving forward with a PoC and we would like to avoid as many pitfalls as possible if we decide to run it in production. To make it more visual, in the diagram below there is the high level vision of the architecture we currently have in mind. Your feedback would be highly appreciated.

1 Like

I don’t see any concerns about scalability assuming that each workflow instance is of limited size.

Task queues are not going to be a bottleneck. They support partitioning and given enough partitions you can saturate the whole cluster using a single task queue for all workflows.