Thank you very much for providing such a fantastic product! I’m just beginning to explore Temporal, and I can feel a bit of a steep learning curve.
I have a question regarding the optimal architecture for solving the following task:
Our service sends requests to N (let’s assume 30) different LLM models, and we need to manage these requests in a way that doesn’t exceed the rate limits, which can vary per model (e.g., max 50, 100, 200 requests per minute).
I see two potential approaches:
- Create a separate queue and worker for each type of LLM model. However, this seems like an expensive solution due to the need to maintain a large number of independent workers.
- Use a single queue for all model types and simply describe a retry mechanism. However, this may result in a large number of unsuccessful requests waiting for the throttling period to reset.
Could you please provide guidance on the best approach to take in this scenario?