Best Practices for Managing Throttling Limits

Thank you very much for providing such a fantastic product! I’m just beginning to explore Temporal, and I can feel a bit of a steep learning curve.

I have a question regarding the optimal architecture for solving the following task:

Our service sends requests to N (let’s assume 30) different LLM models, and we need to manage these requests in a way that doesn’t exceed the rate limits, which can vary per model (e.g., max 50, 100, 200 requests per minute).

I see two potential approaches:

  1. Create a separate queue and worker for each type of LLM model. However, this seems like an expensive solution due to the need to maintain a large number of independent workers.
  2. Use a single queue for all model types and simply describe a retry mechanism. However, this may result in a large number of unsuccessful requests waiting for the throttling period to reset.

Could you please provide guidance on the best approach to take in this scenario?

You can run more than one worker per process. So you can run all 30 of them in a single process. So I recommend (1).