Occasional "Consistent query buffer is full"

We’ve been seeing quite a few “consistent query buffer is full, cannot accept new consistent queries” coming from the history service. I noticed there is a similar topic (Often seeing "consistent query buffer is full, cannot accept new consistent queries" - #2 by Wenquan_Xing), but I am unsure if it is related as we are using the Go SDK.

Server: 1.12.0 (hosted on Kubernetes with Postgresql as database)
Go SDK: 1.10.0 (on Go 1.17)

We have also been seeing an increase in *serviceerror.DeadlineExceeded (though not significant in absolute terms), so I am wondering if this is somehow related (or perhaps related to How to best handle mysterious context deadline exceeded/502 errors instead, but unsure if that applies to Kubernetes).

It looks like too many simultaneous queries were sent to the same workflow instance. Do you call query in a loop from multiple processes?

Sort of. I’m following Signalling system - Human driven workflows by signalling, then querying occasionally. But I do have quite a few “human driven workflows”, so there can occasionally be a surge in queries. Is this not the way to go?

(Worth mentioning that I am aware of the “synchronous proxy” workflow pattern, but I only became aware of it after putting quite a few workflows into production, so that migration won’t happen any time soon)

This message indicates that a single workflow instance received many queries. So the surge in queries to multiple workflows shouldn’t be an issue. But the surge to the single workflow might cause it.

Ahh, by workflow instance, do you mean a workflow run? Is there any way to identify which one it is if so?

Also, is there a definition of “many”? Would this be like in the thousands?

currently, the max allowed buffered query is 1

Ahh, by workflow instance, do you mean a workflow run?

per namespace & workflow ID & run ID

Also, is there a definition of “many”? Would this be like in the thousands?

per namespace & workflow ID & run ID query qps should be roughly < 10?
what is the reason for high frequency query?

1 Like

I was just using an example, in reality, I think I would be doing at most 3 qps (though in most cases, it’d be like 1 qps per namespace and workflow ID and run ID), so I’m still unsure how I hit that limit.

Nonetheless, I think this gives me good information to investigate further. Do you know if it is possible to isolate the specific workflow?

currently there is no detailed logging (workflow ID & run ID) related to API calls, since caller should be able to see the error directly

Sorry, would this be similar to the “consistent query buffer is full” error? Because I don’t think I see that, but I do see the *serviceerror.DeadlineExceeded error very roughly around the same time.

seems that existing error type is not entire correct, create an issue for tracking

ref: Split resource limit exceed error into user facing & internal facing error types · Issue #1966 · temporalio/temporal · GitHub

1 Like

Updates for anyone who comes across this:

  • We never really gotten to the bottom of this
  • We reduced the prevalence of it through a combination of less querying, less signalling, shorter timeouts, and retries
  • We reduced it further by tweaking the visibility store max and idle connections
  • Deploying 1.13.x did not make noticeable differences
  • Upgrading the SDK to v1.11.x did not make noticeable differences
  • In the end, what made the biggest difference was upgrading to v1.14.1 (pretty much no occurrence since), and the SDK to v1.12.0 – though could be a red herring, nonetheless, problem solved
1 Like