Error: query directly though matching on non-sticky failed

We’re seeing a constant rate of this error from the history service for one specific workflow, and I’m not sure A) how we’re causing it and B) how to fix it. Any insight would be appreciated!

The workflow is running in an app called buoy and queried by a separate app called orca using the following code:

// workflowClientProvider is part of our own SDK to auto-handle XDC routing
val workflowClient = workflowClientProvider.get("namespace")

val stub = workflowClient.newWorkflowStub(
  RolloutWorkflow::class.java,
  rolloutId,
  Optional.empty()
)

// the query method that's having a bad time
return stub.nextStep

Additional information:

Temporal Server: 1.13.1
Java SDK: 1.7.1
error: context canceled
logging-call-at: historyEngine.go:941
service: history

go.temporal.io/server/common/log.(*zapLogger).Error
	/go/pkg/mod/go.temporal.io/server@v1.13.1/common/log/zap_logger.go:142
go.temporal.io/server/service/history.(*historyEngineImpl).queryDirectlyThroughMatching
	/go/pkg/mod/go.temporal.io/server@v1.13.1/service/history/historyEngine.go:941
go.temporal.io/server/service/history.(*historyEngineImpl).QueryWorkflow
	/go/pkg/mod/go.temporal.io/server@v1.13.1/service/history/historyEngine.go:790
go.temporal.io/server/service/history.(*Handler).QueryWorkflow.func1
	/go/pkg/mod/go.temporal.io/server@v1.13.1/service/history/handler.go:968
go.temporal.io/server/common/backoff.RetryContext
	/go/pkg/mod/go.temporal.io/server@v1.13.1/common/backoff/retry.go:125
go.temporal.io/server/service/history.(*Handler).QueryWorkflow
	/go/pkg/mod/go.temporal.io/server@v1.13.1/service/history/handler.go:966
go.temporal.io/server/api/historyservice/v1._HistoryService_QueryWorkflow_Handler.func1
	/go/pkg/mod/go.temporal.io/server@v1.13.1/api/historyservice/v1/service.pb.go:1401
go.temporal.io/server/common/rpc/interceptor.(*RateLimitInterceptor).Intercept
	/go/pkg/mod/go.temporal.io/server@v1.13.1/common/rpc/interceptor/rate_limit.go:83
google.golang.org/grpc.chainUnaryInterceptors.func1.1
	/go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1113
go.temporal.io/server/common/rpc/interceptor.(*TelemetryInterceptor).Intercept
	/go/pkg/mod/go.temporal.io/server@v1.13.1/common/rpc/interceptor/telemetry.go:108
google.golang.org/grpc.chainUnaryInterceptors.func1.1
	/go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1116
go.temporal.io/server/common/metrics.NewServerMetricsTrailerPropagatorInterceptor.func1
	/go/pkg/mod/go.temporal.io/server@v1.13.1/common/metrics/grpc.go:113
google.golang.org/grpc.chainUnaryInterceptors.func1.1
	/go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1116
go.temporal.io/server/common/metrics.NewServerMetricsContextInjectorInterceptor.func1
	/go/pkg/mod/go.temporal.io/server@v1.13.1/common/metrics/grpc.go:66
google.golang.org/grpc.chainUnaryInterceptors.func1.1
	/go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1116
go.temporal.io/server/common/rpc.ServiceErrorInterceptor
	/go/pkg/mod/go.temporal.io/server@v1.13.1/common/rpc/grpc.go:131
google.golang.org/grpc.chainUnaryInterceptors.func1.1
	/go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1116
google.golang.org/grpc.chainUnaryInterceptors.func1
	/go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1118
go.temporal.io/server/api/historyservice/v1._HistoryService_QueryWorkflow_Handler
	/go/pkg/mod/go.temporal.io/server@v1.13.1/api/historyservice/v1/service.pb.go:1403
google.golang.org/grpc.(*Server).processUnaryRPC
	/go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1279
google.golang.org/grpc.(*Server).handleStream
	/go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:1608
google.golang.org/grpc.(*Server).serveStreams.func1.2
	/go/pkg/mod/google.golang.org/grpc@v1.41.0/server.go:923

What’s the qps “orca” is sending to this workflow?
There have been some similar issues reported with too many queries sent to a single workflow, for example here.

Very low, it spikes to ~0.7 RPS, but average is 0.1 RPS:

I wouldn’t expect that this rate would cause problems.

Could you provide the error logged on historyEngine.go:941?

Also can you assure that the task queue used by the query still has worker(s) polling it? If there is, can you see if they are logging any errors?

The error logged was provided in the original post. Looks like the workflow had a bug which was making it generate ~1000 events per minute. Once we fixed that so that the event history wasn’t blowing up, these errors went away.

Hi @RobZienert can you share the solution which you applied to fix this ?
Thanks

The developer had removed a Workflow.sleep that was being used to rate limit a while (true) { activity.foo() } call. Adding the sleep back restored the workflow to reasonable event rates.