Occasional "Consistent query buffer is full"

mrsaints · September 21, 2021, 10:39am

We’ve been seeing quite a few “consistent query buffer is full, cannot accept new consistent queries” coming from the history service. I noticed there is a similar topic (Often seeing "consistent query buffer is full, cannot accept new consistent queries" - #2 by Wenquan_Xing), but I am unsure if it is related as we are using the Go SDK.

Server: 1.12.0 (hosted on Kubernetes with Postgresql as database)
Go SDK: 1.10.0 (on Go 1.17)

We have also been seeing an increase in *serviceerror.DeadlineExceeded (though not significant in absolute terms), so I am wondering if this is somehow related (or perhaps related to How to best handle mysterious context deadline exceeded/502 errors instead, but unsure if that applies to Kubernetes).

maxim · September 22, 2021, 5:17pm

It looks like too many simultaneous queries were sent to the same workflow instance. Do you call query in a loop from multiple processes?

mrsaints · September 22, 2021, 7:15pm

Sort of. I’m following Signalling system - Human driven workflows by signalling, then querying occasionally. But I do have quite a few “human driven workflows”, so there can occasionally be a surge in queries. Is this not the way to go?

(Worth mentioning that I am aware of the “synchronous proxy” workflow pattern, but I only became aware of it after putting quite a few workflows into production, so that migration won’t happen any time soon)

maxim · September 22, 2021, 7:17pm

This message indicates that a single workflow instance received many queries. So the surge in queries to multiple workflows shouldn’t be an issue. But the surge to the single workflow might cause it.

mrsaints · September 22, 2021, 7:19pm

Ahh, by workflow instance, do you mean a workflow run? Is there any way to identify which one it is if so?

Also, is there a definition of “many”? Would this be like in the thousands?

Wenquan_Xing · September 22, 2021, 9:10pm

currently, the max allowed buffered query is 1
ref:

github.com

temporalio/temporal/blob/58393935e2abc1973ec148bdc1338d16c286398b/service/history/historyEngine.go#L795


      
          		}
          		req.Execution.RunId = msResp.Execution.RunId
          		return e.queryDirectlyThroughMatching(ctx, msResp, request.GetNamespaceId(), req, scope)
          	}
          
          	// If we get here it means query could not be dispatched through matching directly, so it must block
          	// until either an result has been obtained on a workflow task response or until it is safe to dispatch directly through matching.
          	sw := scope.StartTimer(metrics.WorkflowTaskQueryLatency)
          	defer sw.Stop()
          	queryReg := mutableState.GetQueryRegistry()
          	if len(queryReg.GetBufferedIDs()) >= e.config.MaxBufferedQueryCount() {
          		scope.IncCounter(metrics.QueryBufferExceededCount)
          		return nil, consts.ErrConsistentQueryBufferExceeded
          	}
          	queryID, termCh := queryReg.BufferQuery(req.GetQuery())
          	defer queryReg.RemoveQuery(queryID)
          	release(nil)
          	select {
          	case <-termCh:
          		state, err := queryReg.GetTerminationState(queryID)
          		if err != nil {

github.com

temporalio/temporal/blob/58393935e2abc1973ec148bdc1338d16c286398b/service/history/configs/config.go#L416


      
          
          		ReplicationTaskProcessorErrorRetryWait:               dc.GetDurationPropertyFilteredByShardID(dynamicconfig.ReplicationTaskProcessorErrorRetryWait, 1*time.Second),
          		ReplicationTaskProcessorErrorRetryBackoffCoefficient: dc.GetFloat64PropertyFilteredByShardID(dynamicconfig.ReplicationTaskProcessorErrorRetryBackoffCoefficient, 1.2),
          		ReplicationTaskProcessorErrorRetryMaxInterval:        dc.GetDurationPropertyFilteredByShardID(dynamicconfig.ReplicationTaskProcessorErrorRetryMaxInterval, 5*time.Second),
          		ReplicationTaskProcessorErrorRetryMaxAttempts:        dc.GetIntPropertyFilteredByShardID(dynamicconfig.ReplicationTaskProcessorErrorRetryMaxAttempts, 80),
          		ReplicationTaskProcessorErrorRetryExpiration:         dc.GetDurationPropertyFilteredByShardID(dynamicconfig.ReplicationTaskProcessorErrorRetryExpiration, 5*time.Minute),
          		ReplicationTaskProcessorNoTaskRetryWait:              dc.GetDurationPropertyFilteredByShardID(dynamicconfig.ReplicationTaskProcessorNoTaskInitialWait, 2*time.Second),
          		ReplicationTaskProcessorCleanupInterval:              dc.GetDurationPropertyFilteredByShardID(dynamicconfig.ReplicationTaskProcessorCleanupInterval, 1*time.Minute),
          		ReplicationTaskProcessorCleanupJitterCoefficient:     dc.GetFloat64PropertyFilteredByShardID(dynamicconfig.ReplicationTaskProcessorCleanupJitterCoefficient, 0.15),
          
          		MaxBufferedQueryCount:                 dc.GetIntProperty(dynamicconfig.MaxBufferedQueryCount, 1),
          		MutableStateChecksumGenProbability:    dc.GetIntPropertyFilteredByNamespace(dynamicconfig.MutableStateChecksumGenProbability, 0),
          		MutableStateChecksumVerifyProbability: dc.GetIntPropertyFilteredByNamespace(dynamicconfig.MutableStateChecksumVerifyProbability, 0),
          		MutableStateChecksumInvalidateBefore:  dc.GetFloat64Property(dynamicconfig.MutableStateChecksumInvalidateBefore, 0),
          
          		ReplicationEventsFromCurrentCluster:    dc.GetBoolPropertyFnWithNamespaceFilter(dynamicconfig.ReplicationEventsFromCurrentCluster, false),
          		StandbyTaskReReplicationContextTimeout: dc.GetDurationPropertyFilteredByNamespaceID(dynamicconfig.StandbyTaskReReplicationContextTimeout, 3*time.Minute),
          
          		EnableDropStuckTaskByNamespaceID: dc.GetBoolPropertyFnWithNamespaceIDFilter(dynamicconfig.EnableDropStuckTaskByNamespaceID, false),
          		SkipReapplicationByNamespaceID:   dc.GetBoolPropertyFnWithNamespaceIDFilter(dynamicconfig.SkipReapplicationByNamespaceID, false),

Ahh, by workflow instance, do you mean a workflow run?

per namespace & workflow ID & run ID

Also, is there a definition of “many”? Would this be like in the thousands?

per namespace & workflow ID & run ID query qps should be roughly < 10?
what is the reason for high frequency query?

mrsaints · September 22, 2021, 9:33pm

I was just using an example, in reality, I think I would be doing at most 3 qps (though in most cases, it’d be like 1 qps per namespace and workflow ID and run ID), so I’m still unsure how I hit that limit.

Nonetheless, I think this gives me good information to investigate further. Do you know if it is possible to isolate the specific workflow?

Wenquan_Xing · September 22, 2021, 10:35pm

currently there is no detailed logging (workflow ID & run ID) related to API calls, since caller should be able to see the error directly

mrsaints · September 22, 2021, 10:51pm

Sorry, would this be similar to the “consistent query buffer is full” error? Because I don’t think I see that, but I do see the *serviceerror.DeadlineExceeded error very roughly around the same time.

Wenquan_Xing · September 23, 2021, 3:42am

seems that existing error type is not entire correct, create an issue for tracking

ref: Split resource limit exceed error into user facing & internal facing error types · Issue #1966 · temporalio/temporal · GitHub

mrsaints · January 10, 2022, 10:13pm

Updates for anyone who comes across this:

We never really gotten to the bottom of this
We reduced the prevalence of it through a combination of less querying, less signalling, shorter timeouts, and retries
We reduced it further by tweaking the visibility store max and idle connections
Deploying 1.13.x did not make noticeable differences
Upgrading the SDK to v1.11.x did not make noticeable differences
In the end, what made the biggest difference was upgrading to v1.14.1 (pretty much no occurrence since), and the SDK to v1.12.0 – though could be a red herring, nonetheless, problem solved

RichRM · February 20, 2023, 8:35pm

We are seeing this error intermittently as well. We are on 1.19.1 and have set the following in our dynamic config:

history.MaxBufferedQueryCount:
- value: 4
  constraints: {}

How do we know what to set this value to, and/or if any other vars also need adjusting? Worth noting, all the services are running as separate containers and have plenty of spare CPU/memory.

tihomir · February 20, 2023, 9:26pm

There shouldn’t be a risk setting it to 5 which should cover most use cases imo.

RichRM · February 21, 2023, 12:10am

hmm, tried it at 5 and still seeing these (in frontend & history services).

maxim · February 21, 2023, 12:49am

Do you make a lot of queries to the same workflow instance?

Steve_Farthing · February 21, 2023, 5:36pm

Yes, we may have a handful of nodes polling a single workflow Id at the same time.

maxim · February 21, 2023, 5:53pm

You can try increasing the buffer or just ignore intermittent errors.

I would reevaluate your design to avoid such frequent queries. In the future, we plan to add the ability to wait for state changes using a long poll to improve the experience in such scenarios.

Steve_Farthing · February 21, 2023, 6:17pm

ok Thanks! I came across a similar issue here (Best practices for providing UI updates on workflow status?) and I see redis streams are proposed for reporting details back to an UI (rather having multiple nodes long polling Temporal). Does this seem like a reasonable short-term solution to fix the problem we are running into?

maxim · February 21, 2023, 6:30pm

This sounds like a sensible workaround.

Topic		Replies	Views
Often seeing "consistent query buffer is full, cannot accept new consistent queries" Community Support	2	1676	February 11, 2021
Limit concurrent database queries across workflow using mutex? Community Support go-sdk	3	35	December 13, 2024
Does high QPS against Workflow's Query methods effect thread count limit? Community Support java-sdk	2	669	October 22, 2021
Innacurate Query Results Community Support go-sdk , query	0	204	November 29, 2023
Workflow signal and query Community Support go-sdk , general-impl	1	739	June 20, 2023

Occasional "Consistent query buffer is full"

Related topics