I assume that your question applies to any queuing solution (Kafka, RabbitMQ, etc.), not only SQS.
What are the features of a distributed queue?
- Producers can enqueue tasks
- Accumulate (backlog) tasks if consumers are down or slow
- Deliver each task to a single consumer
- The consumer can report task completion (ack) or failure (nack) back to the queue
- If a task is not acked/nacked within a configured timeout it is considered nacked.
- Some queues support extending the running task timeout (aka heartbeat)
- Nacked tasks are redeliverd, possibly after some backoff interval
- Some queues support Dead Letter Queues (DLQs). Tasks that are nacked too many times are moved to a DLQ.
What are the common limitations?
- No transactions between the queue and other data storages
- Maximum task execution time is limited even when heartbeating is supported.
- The duration of retries is limited. So it is not possible to retry task for a few hours, for example.
- Task cancellation is not supported
- Error handling in case of task failures is very primitive. DLQ is the only mechanism.
- Getting the status of a specific task is not supported.
- Tasks are executed with at least once semantic
So when task qeueues are a good fit as an application level primitive?
- The task is stateless. So its creation doesn’t require an update to some other DB as there are no transactions between the DB and the queue.
- Idempotent task
- Short task
- Short duration/number of retries
- No error handling besides retrying later from DLQ is needed
- Human intervention is OK to deal with the messages in DLQ
- No need to get a task status
- No need to cancel the task
- The task is fully independent and doesn’t depend on other tasks or cause the execution of other tasks.
My experience tells me that a very narrow set of scenarios fit these limitations. In most cases, tasks are not independent, can execute for a long time, require long retries, require actual error handling, and benefit from transactionality between the database and the queue.