We have several services which expose enqueue and dequeue API’s. Clients interact with a service:
- Client A enqueues a task
- Client B dequeues a task; other clients cannot dequeue it until client B ack’s or times out
- Client B acknowledges the task completed; task is removed from the queue
Implementation requirements:
- Ackowledge timeout and re-enqueue if Client B times out
- If a task isn’t polled in some timeout, move it to the head of the queue
- Ensure atomic dequeue (no 2 clients can dequeue the same task)
- Ensure task uniqueness (no 2 enqueues can succeed with the same task ID)
- Get task status (ie random access)
- Long (~weeks) durability for tasks
- Regional replication for failover
Performance & scale requirements:
- The service needs to accept dequeue requests from multiple regions (enqueue can be single-region)
- Dequeue latency ~200ms is acceptable
- We expect max 5000 tasks in the queue, and <1k enqueue and <1k dequeue requests/s.
Due to software deprecations, we need to refactor these services implementation. These requirements rule out obvious queue solutions (SQS, Kafka, etc), and I’m evaluating Temporal as a potential replacement. Note: we’d rather not migrate clients because some are 3rd parties we do not control; in other words, changes should not break existing API’s.
Proposed design
Enqueue implementation can start a workflow async per task, returning the workflow ID. This ensures task uniqueness and enables random access via get workflow API. The workflow implementation consists of 1 activity.
Dequeue implementation polls the activity task list and returns the task data if found.
Acknowledge implementation completes the activity task completion. Because it’s the only activity, it also completes the workflow.
Open questions
- Is Temporal a good fit? Is there a better design, considering that clients will not adopt Temporal?
- This design kind of re-implements the worker SDK - is there existing SDK features that expose this kind of “synchronous worker”?
- We need to dequeue the first task found in many (~100) potentially sparse task lists. This would be very slow if we had to poll 1-by-1. Is there “multi task list poll” API which polls multiple tasks lists concurrently yet returns only 1 or no task?
Thank you