Working with queues

Hi all.
We are designing an asynchronous system. In it, incoming requests will be processed by the workflow, by transmitting a signal. All this will happen at a certain speed for the client, which depends on the capacity of the cluster and the consumption settings from the queues. There is at least a separate queue for each client.
But there can be A LOT of requests, more than the client is allowed to process. Our first thought was to put a broker (Kafka) in front of the cluster as pressure compensation. But if you remember what a signal is, this is the same message to the temporal queue as to the kafka queue, then you can come to the conclusion that an intermediate broker may be redundant. However, Kafka is a tried and tested thing and is able to withstand heavy loads with relatively small resources. What will happen to the temporal if it has a lot of queues? Is there any way to manage unprocessed messages in queues (observe, destroy, etc.)?
For example, perhaps we can get a metric on the number of messages before sending a signal?
My respect for your work!

All this will happen at a certain speed for the client, which depends on the capacity of the cluster and the consumption settings from the queues.

Temporal service also has rate limits you can provide for frontend / history / matching across all namespaces via configs (dynamic config):
frontend.rps
history.rps
matching.rps

And per namespace per frontend role:
frontend.namespaceRPS

What will happen to the temporal if it has a lot of queues?

There is no limit on number of task queues by design.

perhaps we can get a metric on the number of messages before sending a signal?

I think you might be asking for a task queue backlog count which is not currently supported.
For task queue monitoring best way is to watch SDK metrics, especially workflow_task_schedule_to_start_latency and activity_schedule_to_start_latency.

In it, incoming requests will be processed by the workflow, by transmitting a signal.

Can you provide more info on your particular use case please? Would help with being able to provide better response on best practices with Temporal.

What is max rate of these signals per workflow instance for your use case?

Sorry for my English. I will try to describe the case more accurately.
Let’s imagine that we have 1k clients. Each client has from 1 to 1k processes. Each client has its own queue, so as not to create a race between clients. But thus race between processes of the client can be.
As far as I understand, in such a scenario, we will need to have a dedicated worker for each client?

If we expand the description and get more, then our task is to take all the Signal-With-Start of each client and process them gradually as the cluster performs. This requires 2 queues per client, which means at least 2 workers. See picture.

What is the maximum rate of task creation per client?

It is difficult to say for sure, while the estimate is up to 300-500 RPS peak.
But there may not be 1k clients, but more over time.

Maybe I got excited :). The total RPS is expected to be around 15k.
Here is the question, the client can make 100-300 rps for a short time, we must accept them and put them in the queue for processing. We can process them already at a lower speed.

You can have a worker with a correspondent task queue per client. The complexity will be in managing the resources across all these workers.

The need to provide a fair load distribution across many clients in the same namespace is a pretty common use case. So we certainly will try to come up with a general solution in the future.

Just to put the other responses in perspective, however … we would not consider the request rates you mention (if I understood correctly: 15k rps total, 500 rps per client) “high”. You can easily achieve this kind of throughput (even at steady state) without any intermediate queuing (such as kafka).

We have a similar use case in our system where we’re receiving a large number of requests
(millions) that need to be processed asynchronously by a Temporal workflow. Our initial thought was to abandon using Redis streams and just rely on Temporal task queues, but we were worried about the limitations of the Event History size and Event count for long-running workflows in Temporal.

After some research, we’ve come up with a few solutions that we’re considering:

  • Using a dedicated worker that consumes messages from the Redis stream and starts a new workflow instance for each message. This worker would need to be configured with the appropriate Redis connection information, and should be able to handle the rate of incoming messages. We could also implement a rate limiter on the worker to prevent it from becoming overwhelmed.
  • Using a separate process that consumes messages from the Redis stream and enqueues them into a Temporal task queue. This process could be rate-limited to ensure that it doesn’t enqueue messages faster than they can be consumed by the Temporal worker.
  • Using the Continue-As-New feature to close the current Workflow Execution and create a new Workflow Execution in a single atomic operation.

We’re still in the process of evaluating these solutions and testing their performance, and we’re open to any other suggestions or insights on how to handle this use case. We’re particularly interested in hearing about other people’s experience with using Redis streams in conjunction with Temporal.

Edit: it is a slow queue. the queue fills up and then it can takes months before we clear the queue and we can’t really afford processing it faster (we have a few hundred api clients running concurrently at most) and don’t need to go faster then this).

I have spent most of the weekend reading all the documentation on temporal.io and watched a lot of the videos on the YouTube Temporal channel. I thank for you this.

I think I was a bit confused with my understanding of limits.

Please correct me if I am wrong but there aren’t any limits to the number of workflows you can have right? So even if I have millions of requests I can just treat them as separate workflows.

You are correct. There is no limit on the number of open workflows. Limits are around how many actions per second you can run or how large a single workflow can get.

1 Like

I am curious what you all decided here? We are in a similar boat whereby we want to maintain our use of SQS queues as the backing queue but want to be able to leverage the power of the performance knobs when using Temporal workers

SQS queues don’t provide enough features to be used as Temporal Task queues.