I am planning on transitioning our company’s architecture to utilize temporal, for bullet-proof resiliency. For starters, we will self-host the temporal server on our side.
Use Case/Architecture:
We have a micro-service (A) that is responsible to ingest giant CSV files (sometimes upto 10m rows) from several different sources. Currently, this job runs on a cron, checks if new files are available, and then ingests and persists the 10m rows in DB. We do some validations, checks in between to ensure data integrity, sometimes by even calling external services. Each row that is parsed from the CSV is pushed to a queue (so 10m+ messages) for another micro-service to consume.
We have 2 types of files:
- Order details
- Shipping details
Micro-service(B) is responsible for consuming these messages, and acting on it. The main goal of this service is that it must wait for the data from both sources. The data can unfortunately come to us in any order. It then does some external lookups, validates the data in both sources match, and updates the status in DB accordingly.
Here is what I am thinking of my temporal architecture:
Microservice (A): Every time a cron finds a new file, start a new workflow FileProcessorWorkflow
The FIleProcessorWorkflow
does the following:
- Chunk the file into manageable chunk size inside the workflow
- Call an activity which takes the chunk, and then does validations
- Call an activity that pushes each row in the chunk to an external queue, with all relevant metadata
- Call an activity to update the status of the File in the DB to be processed
MicroService (B): For each unique key (the matching key between the two sources) create a temporal workflow called DataComparisonWorkflow
- Create a new workflow the first time a message with that key comes
- Call an activity to perform any validations as needed, by calling external services
- Wait for the data from the other source(s) to come through
- When the next message comes through with the same key, progress the state of the existing workflow, do the data comparison between the two sources, and update the status in the DB accordingly. In the future we might extend to this to 10+ sources and match all of them (Not sure which paradigm is best to achieve this)
Here are some concerns/questions I have:
- Is it good practice to adapt to this queue based approach which will rapidly create 10m+ workflows per file?
- Are there any recommended way for “Wait for the data from the other source to come through.” I read about signals, but I was not sure how it fits into the above architecture
- Any suggestions to make the architecture process more in parallel, so it’s scalable to multiple files coming in at once? Imagine 1000s of files, each with 10m+ rows, get parsed, transformed, validated and then pushed to a queue for matching.
- Is it better to adopt this queue-based approach for Microservice (B) versus creating a child workflow for each row in the file inside Micro-service (A)? The one concern I have with queue based approach is I could lose some reference to the parent file workflow. If matching fails, I might want to do some sort of ‘compensating functions’ (Micro-service (B) calling Micro-service (A)) to update the state of the File so it gets re-pulled again the next day.