Occasional "Consistent query buffer is full"

We’ve been seeing quite a few “consistent query buffer is full, cannot accept new consistent queries” coming from the history service. I noticed there is a similar topic (Often seeing "consistent query buffer is full, cannot accept new consistent queries" - #2 by Wenquan_Xing), but I am unsure if it is related as we are using the Go SDK.

Server: 1.12.0 (hosted on Kubernetes with Postgresql as database)
Go SDK: 1.10.0 (on Go 1.17)

We have also been seeing an increase in *serviceerror.DeadlineExceeded (though not significant in absolute terms), so I am wondering if this is somehow related (or perhaps related to How to best handle mysterious context deadline exceeded/502 errors instead, but unsure if that applies to Kubernetes).

It looks like too many simultaneous queries were sent to the same workflow instance. Do you call query in a loop from multiple processes?

Sort of. I’m following Signalling system - Human driven workflows by signalling, then querying occasionally. But I do have quite a few “human driven workflows”, so there can occasionally be a surge in queries. Is this not the way to go?

(Worth mentioning that I am aware of the “synchronous proxy” workflow pattern, but I only became aware of it after putting quite a few workflows into production, so that migration won’t happen any time soon)

This message indicates that a single workflow instance received many queries. So the surge in queries to multiple workflows shouldn’t be an issue. But the surge to the single workflow might cause it.

Ahh, by workflow instance, do you mean a workflow run? Is there any way to identify which one it is if so?

Also, is there a definition of “many”? Would this be like in the thousands?

currently, the max allowed buffered query is 1
ref:


Ahh, by workflow instance, do you mean a workflow run?

per namespace & workflow ID & run ID


Also, is there a definition of “many”? Would this be like in the thousands?

per namespace & workflow ID & run ID query qps should be roughly < 10?
what is the reason for high frequency query?

1 Like

I was just using an example, in reality, I think I would be doing at most 3 qps (though in most cases, it’d be like 1 qps per namespace and workflow ID and run ID), so I’m still unsure how I hit that limit.

Nonetheless, I think this gives me good information to investigate further. Do you know if it is possible to isolate the specific workflow?

currently there is no detailed logging (workflow ID & run ID) related to API calls, since caller should be able to see the error directly

Sorry, would this be similar to the “consistent query buffer is full” error? Because I don’t think I see that, but I do see the *serviceerror.DeadlineExceeded error very roughly around the same time.

seems that existing error type is not entire correct, create an issue for tracking

ref: Split resource limit exceed error into user facing & internal facing error types · Issue #1966 · temporalio/temporal · GitHub

1 Like

Updates for anyone who comes across this:

  • We never really gotten to the bottom of this
  • We reduced the prevalence of it through a combination of less querying, less signalling, shorter timeouts, and retries
  • We reduced it further by tweaking the visibility store max and idle connections
  • Deploying 1.13.x did not make noticeable differences
  • Upgrading the SDK to v1.11.x did not make noticeable differences
  • In the end, what made the biggest difference was upgrading to v1.14.1 (pretty much no occurrence since), and the SDK to v1.12.0 – though could be a red herring, nonetheless, problem solved
1 Like

We are seeing this error intermittently as well. We are on 1.19.1 and have set the following in our dynamic config:

history.MaxBufferedQueryCount:
- value: 4
  constraints: {}

How do we know what to set this value to, and/or if any other vars also need adjusting? Worth noting, all the services are running as separate containers and have plenty of spare CPU/memory.

There shouldn’t be a risk setting it to 5 which should cover most use cases imo.

hmm, tried it at 5 and still seeing these (in frontend & history services). :thinking:

Do you make a lot of queries to the same workflow instance?

Yes, we may have a handful of nodes polling a single workflow Id at the same time.

You can try increasing the buffer or just ignore intermittent errors.

I would reevaluate your design to avoid such frequent queries. In the future, we plan to add the ability to wait for state changes using a long poll to improve the experience in such scenarios.

ok Thanks! I came across a similar issue here (Best practices for providing UI updates on workflow status?) and I see redis streams are proposed for reporting details back to an UI (rather having multiple nodes long polling Temporal). Does this seem like a reasonable short-term solution to fix the problem we are running into?

This sounds like a sensible workaround.