Implementing a Rate Limiter Service for Workflow Activities

Hey Temporal community,

I’m currently working on a workflow that involves performing HTTP requests to multiple websites, each with its own rate limit. I’m facing a challenge in implementing a rate limiter service to ensure that the requests are made within the allowed limits, even in the presence of retries. Here’s a simplified representation of my workflow code:

workflow.defn:
  activity1()
  activity2()
  send_http_request_to_site_activity('example.com')
  activity_foo()
  activity_bar()

During the execution of the send_http_request_to_site_activity activity, I need to check with the rate limiter service if it’s okay to proceed or if I should wait. However, I want to avoid blocking the worker since there could be a large number of requests in the execution queue. Additionally, I need to ensure that the rate limiter is informed when the request is completed, so it can allow the next HTTP request to be executed in turn.

I have more than 20,000 different websites to which I need to send HTTP requests, each with its own rate limit (also, I use the Python SDK)

I’ve attempted three approaches, but encountered some issues:

  1. Using Asynchronous Activity Completion: In this approach, I modified the workflow code as follows:

    workflow.defn:
      activity1()
      activity2()
      wait_for_rate_limiter_access('example.com')  // This activity uses Asynchronous Activity Completion and waits until the rate limiter service allows access to continue.
      send_http_request_to_site_activity('example.com')
      notify_rate_limiter_request_completion('example.com')  // Notifies the rate limiter that the request is completed.
      activity_foo()
      activity_bar()
    

    The problem with this approach is that during retries, if the send_http_request_to_site_activity fails, the wait_for_rate_limiter_access activity will not be triggered again before the subsequent attempts. Additionally, if there is a failure in the send_http_request_to_site_activity , the notify_rate_limiter_request_completion activity will not be executed.

  2. Using Signals:: I explored using signals to address the rate limiter interaction. However, I faced similar issues as in the first approach.

  3. Using an HTTP Proxy: As an alternative, I also attempted to utilize an HTTP proxy that blocks requests according to the rate limit per domain. However, this approach introduced new challenges. The proxy effectively blocked the worker, making it difficult for the rate limiter service to efficiently handle a large number of pending workflows (e.g., more than 80,000) (for example I can’t send more than 1 request per second to each domain, but I can send more than 1000 requests per second for different domains. so by blocking the worker with an HTTP proxy, the proxy server, or in other words, the rate limiter service will in-queue a limited number of requests according to the number of worker processes. now consider with 100 processes we only enqueue 100 requests for the rate limiter service and if all of them are from a single domain, we face a large blocking time since the rate limiter service does not know about other requests from other domains that waiting to run). Additionally, since the proxy centralized the sending of HTTP requests to a single server, bandwidth limits and distribution across multiple worker servers became problematic.

I would greatly appreciate your insights and suggestions on how to overcome these challenges and effectively implement a rate limiter service for my workflow activities.

Of the options given, I think option 1 makes the most sense.

But you’ve already reserved your slot for this call for the wait_for_rate_limiter_access. However, if you’re saying you don’t want to keep that slot reserved during retry backoff, you need custom retry logic which means you can’t really use the Temporal retry, so you can set the max_attempts to 1 in the retry_policy of the call and handle retries manually, doing what you need to do between each retry. Note, if you’re making manual retry which uses a loop, each activity call is more history (not the case with the less flexible built in Temporal retry).

You can put this in a finally. Just make sure that you don’t swallow the outer exception by raising from inside the finally. Also you should know that if a workflow is terminated the workflow code won’t get called anymore, so notify_rate_limiter_request_completion may never get called after wait_for_rate_limiter_access does. Termination occurs via manual call or workflow timeout (which are not set by default), so it is avoidable.

How long do you expect wait_for_rate_limiter_access to reasonably take? Putting the rate limiting near where the actual execution is happening (i.e. at the top of the activity) has value. What do you mean by “blocking the worker”? The worker will be blocked if max_concurrent_activities is reached, but it is only a memory concern to have thousands of async def activities yielded. Just set that number to as high of a backlog you’re willing to have. Granted you will want to heartbeat while waiting which can have a cloud cost. You could even have a task queue per thing you’re rate limiting on (seems like domain here) and have max concurrent only be as high as the likely rate limit. Granted max-concurrent is per worker and you may have many workers for high availability. But still, having max concurrent be the amount of in-flight activities and somewhere near or not too far above your rate limit should be fine.

But option 1 also makes sense for slower rate limits.

1 Like