Frontend Visibility RPS Limits

I’m curious why the default limits for the Frontend service Visibility APIs are so low.

Based on the code it seems that both:

  • MaxNamespaceVisibilityRPSPerInstance
  • MaxNamespaceVisibilityBurstPerInstance

are default set to 10.

We are quickly hitting these limits because our code requires querying the Visibility APIs every 10 seconds for each user using our product. I’ve tested increasing the limits to 1,000 RPS and 20 burst and that gets us to a good state where we are no longer getting errors and I do not see any impact on the server or DB performance. But I’m worried about raising them so drastically from their defaults. Are there unforeseen consequences I may encounter if I continue down this path? Thanks.

This API was never designed for such high rate usage. Please refactor your application. If you describe your use case, we can help develop an alternative.

Understood - what should I be worried about with high rate usage though? I don’t see any issues during my testing.

Our use case is as follows - we have a frontend UI which is displaying currently running workflows of a given type on a given task queue. We need the UI to auto refresh on a small interval so users can monitor the workflows in real time. Is there another way to achieve this without using the visibility api?

Test this with a large number of workflows, and it will start using enough resources to limit the query rate.

You can query individual workflows more frequently as they are going through the core DB. If you need to list workflows frequently, I recommend using an external DB and updating it from activities.

When you say “Test this with a large number of workflows” do you mean large number of running workflows in the namespace or large number of workflows which fit the query filter we are searching for? The number of workflows that fit our query is fairly small (less than 100) but we do have 20,000+ workflows running in the namespace. And I did not notice a jump in compute resources when increasing the visibility limits and performance testing with 100 “users” requesting from the API

The large number of running plus closed workflows that have not yet been deleted through retention.

Can you quantify large? We have 14 day retention in place and don’t expect to exceed ~40,000 workflows in that timeframe

Test it. 40k is not large, but I personally try to avoid solutions that will break if the load on my system increases.

Thanks for the info. I will run some tests to see the limits here. Would you expect any resource limits to be solvable via larger CPU/MEM request limits or additional machines? Or is there something else I would be exhausting?

I don’t know. As we consider this anti-pattern we don’t test for this scenario.