The company I work with is exploring switching some of our processes to be more Workflow dependent.
Currently we use choreography and event driven design to analyze our data (provided by clients).
The analysis code runs on lambda and is triggered by SNS and SQS messages when said data is uploaded to our storage (S3). Some of our lambdas are in Python, others in Typescript
Couple of challenges we run into:
When running this orchestration is ability to wait for N pieces of data before running the processing (fan in).
Observability: As our system grows, the ability to take into account the number of downstream processes the event could trigger downstream is a problem.
Replayable: We want to retry/replay workflows if they fail for some reason.
Versioning: Currently when we update the workflow we are affecting all in-flight messages. We want to have the choice for this so as to not affect currently running workflows.
Products we explored:
Camunda
AWS Step Functions
Apache Airflow (don’t like this as much since its python)
We were thinking of doing some exploratory work so I wanted to get some opinions on whether Temporal is a good choice for our problem. From the docs I’ve read Temporal seems to lean towards a more synchronous workflow execution which would be quite the lift for us to move all of our analytics to.
From the docs I’ve read Temporal seems to lean towards a more synchronous workflow execution which would be quite the lift for us to move all of our analytics to.
I think this is a wrong impression. Temporal is inherently asynchronous and is well suited for event driven use cases.
Temporal scales out with the number of workflow executions. A single workflow execution has limited throughput. Does your use case imply a large number of relatively low traffic entities?
What is you maximum rate of events per second per entity and aggregated across all entities?
Yeah I read it wrong thats my bad. It’s good to hear that we can run async workflows on Temporal.
In what ways do you mean that a single workflow has limited throughput?
To put an example for our case, we have Entity X, Y, and Z.
A single entity X relates to many Y entities. Similarly, a single entity (<5) Y relates to many (<5) Z entities. Z entities hold data that we receive from clients.
We then process data bottom-up the tree. So first process Z entities, then Y (aggregates Z data), then X (aggregates Y data).
We have to wait for Z analytics to complete before Y starts, and similarly with X analysis (hence the fan-in condition stated above).
Currently, we expect less than 1 event/second on an entity (this will probably change as we expand the workflow). We also process a small number (20-50) Y entities day so our workload isn’t big by any means.
Since we run async lambdas as analytics engines, is it recommended for a workflow’s activity to send a request to async lambdas and then have the workflow wait for a signal that the async process has finished? I’m trying to figure out a way how to integrate temporal not sure if this is a popular pattern used with Temporal already.
Yes, the activity for submit and signal to complete is a good pattern in this case. This allows retrying submit request in case of failures pretty fast and still wait as long as needed for the completion.