I’m designing a production-ready backend system in Python that will handle millions of tasks per day. Users submit tasks via a FastAPI endpoint. Each task should be persisted in a durable storage (I’m thinking Postgres) so that its status can be queried (accepted, processing, done, failed). The tasks should then be executed asynchronously, with the workflow updating the task status in the storage once completed.
Task workflow examples:
Perform data enrichment via a third-party API
Update a search engine index
Send processed data to an analytics service
Requirements:
Tasks should execute concurrently across different clients, but sequentially for a single client (strict order per client).
Each task should have a maximum number of attempts, after which it either fails permanently or is rescheduled.
Tasks should support timeouts and cancellation.
Some tasks are recurring, running according to a schedule.
The system must be production-ready, scalable, and able to handle millions of tasks per day with observability.
Questions:
What is the best pattern in Temporal for ensuring per-client sequential execution while allowing full concurrency between different clients?
How can I persist task metadata/status in Postgres while letting the workflow update it asynchronously?
How should I handle max attempts, retries, and backoff for high-throughput asynchronous tasks?
What is the recommended approach for timeouts and cancellations of long-running activities in such workflows?
For recurring tasks, is using Temporal Schedules the most robust approach at this scale?
Any architectural guidance for ingestion (Kafka, queues, etc.), scaling workers, and ensuring reliability?
Thanks in advance for any guidance, best practices, or example patterns.
What is the best pattern in Temporal for ensuring per-client sequential execution while allowing full concurrency between different clients?
What is the maximum task enqueue and execution rate per client in tasks per second? What is the maximum possible number of outstanding tasks per client?
How can I persist task metadata/status in Postgres while letting the workflow update it asynchronously?
With Temporal there is no need for an external DB as all workflow information is already durable.
For recurring tasks, is using Temporal Schedules the most robust approach at this scale?
Millions per day is a pretty low scale.
Any architectural guidance for ingestion (Kafka, queues, etc.), scaling workers, and ensuring reliability?
What is the maximum task enqueue and execution rate per client in tasks per second? What is the maximum possible number of outstanding tasks per client?
Our requirement is to support hundreds of clients with up to tens-hundreds of tasks per second.
Sorry, not sure that I got the point of the second question. If you meant whether it’s okay for our clients to wait for a task to be executed – yes, some delay is okay, there is no need in zero latency. To be more specific, we need a service that provides a possibility of updating Search engine (OS/ES) indices. So that’s why this is important for us to apply those changes in order.
With Temporal there is no need for an external DB as all workflow information is already durable.
We need some sort of persistence for us so that our clients can set task and then query its status (whether it’s completed or not; that’s how they can execute synchronous changes by waiting for a task to be completed and only after that setting the next one)
Our requirement is to support hundreds of clients with up to tens-hundreds of tasks per second.
Is these hundreds of tasks per second for each client?
We need some sort of persistence for us so that our clients can set task and then query its status (whether it’s completed or not; that’s how they can execute synchronous changes by waiting for a task to be completed and only after that setting the next one
If you model tasks as Temporal workflows, then these requirements don’t need a database. Temporal workflows can be queried and waited for by external clients.
Is these hundreds of tasks per second for each client?
Yes, it is. This is not our current load but we’d like to come up with a new solution instead of the current one so that we can have this load.
If you model tasks as Temporal workflows, then these requirements don’t need a database. Temporal workflows can be queried and waited for by external clients.
Will our client still be able to do it using our FastAPI (Python web framework) API? They should not know that they query Workflow status of Temporal.
Temporal doesn’t support out of the box the following requirement:
Tasks should execute concurrently across different clients, but sequentially for a single client (strict order per client).
You need something like Redis streams to queue up tasks per client. Then you can have an activity that listens on that queue and starts Temporal workflows.