Temporal-Server v1.22.0 Errors

Hi,

I would observer couple of common errors with the Temporal Server v1.22.0. DB - Cockroach and conigured Shards - 512. internal-frontend also enabled along with auth enabled.

may i know what is causing the issue?

1. Matching

{“level”:“error”,“ts”:“2023-09-28T17:12:33.460Z”,“msg”:“service failures”,“operation”:“GetTaskQueueUserData”,“wf-namespace”:“temporal-system”,“error”:“task queue closed”,“logging-call-at”:“telemetry.go:328”,“stacktrace”:“go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:156\ngo.temporal.io/server/common/rpc/interceptor.(*TelemetryInterceptor).handleError\n\t/temporal/common/rpc/interceptor/telemetry.go:328\ngo.temporal.io/server/common/rpc/interceptor.(*TelemetryInterceptor).UnaryIntercept\n\t/temporal/common/rpc/interceptor/telemetry.go:169\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\t/temporal/vendor/google.golang.org/grpc/server.go:1179\ngo.temporal.io/server/service.GrpcServerOptionsProvider.getUnaryInterceptors.NewServerMetricsTrailerPropagatorInterceptor.func4\n\t/temporal/common/metrics/grpc.go:113\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\t/temporal/vendor/google.golang.org/grpc/server.go:1179\ngo.temporal.io/server/service.GrpcServerOptionsProvider.getUnaryInterceptors.NewServerMetricsContextInjectorInterceptor.func3\n\t/temporal/common/metrics/grpc.go:66\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\t/temporal/vendor/google.golang.org/grpc/server.go:1179\ngo.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1\n\t/temporal/vendor/go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc/interceptor.go:344\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\t/temporal/vendor/google.golang.org/grpc/server.go:1179\ngo.temporal.io/server/common/rpc.ServiceErrorInterceptor\n\t/temporal/common/rpc/grpc.go:145\ngoogle.golang.org/grpc.NewServer.chainUnaryServerInterceptors.chainUnaryInterceptors.func1\n\t/temporal/vendor/google.golang.org/grpc/server.go:1170\ngo.temporal.io/server/api/matchingservice/v1._MatchingService_GetTaskQueueUserData_Handler\n\t/temporal/api/matchingservice/v1/service.pb.go:660\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/temporal/vendor/google.golang.org/grpc/server.go:1360\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/temporal/vendor/google.golang.org/grpc/server.go:1737\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.1\n\t/temporal/vendor/google.golang.org/grpc/server.go:982”}

2. History

{“level”:“error”,“ts”:“2023-09-28T17:07:43.031Z”,“msg”:“Unable to process new range”,“shard-id”:344,“address”:“100.127.123.67:7234”,“component”:“timer-queue-processor”,“error”:“shard status unknown”,“logging-call-at”:“queue_base.go:316”,“stacktrace”:“go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:156\ngo.temporal.io/server/service/history/queues.(*queueBase).processNewRange\n\t/temporal/service/history/queues/queue_base.go:316\ngo.temporal.io/server/service/history/queues.(*scheduledQueue).processEventLoop\n\t/temporal/service/history/queues/queue_scheduled.go:218”}

Hi @jaffarsadik , I’m using v1.22 as well and seeing similar errors.

Did you manage to find out the cause for these issues ? We are seeing similar things when load test temporal with Cassandra (3 nodes, temporal db’s replica is 1, LOCAL_QUORUM consistency)

This seems to cause some workflows are not picked up by worker even-though the worker is connected and have already finished all previous workflows. Restart worker doesn’t help. The stuck workflows will be picked up after StartToClose timed out.

Matching:

{"level":"error","ts":"2023-11-16T04:13:53.163Z","msg":"service failures","operation":"GetTaskQueueUserData","wf-namespace":"temporal-system","error":"task queue closed","logging-call-at":"telemetry.go:328","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:156\ngo.temporal.io/server/common/rpc/interceptor.(*TelemetryInterceptor).handleError\n\t/home/builder/temporal/common/rpc/interceptor/telemetry.go:328\ngo.temporal.io/server/common/rpc/interceptor.(*TelemetryInterceptor).UnaryIntercept\n\t/home/builder/temporal/common/rpc/interceptor/telemetry.go:169\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\t/go/pkg/mod/google.golang.org/grpc@v1.57.0/server.go:1179\ngo.temporal.io/server/common/metrics.NewServerMetricsTrailerPropagatorInterceptor.func1\n\t/home/builder/temporal/common/metrics/grpc.go:113\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\t/go/pkg/mod/google.golang.org/grpc@v1.57.0/server.go:1179\ngo.temporal.io/server/common/metrics.NewServerMetricsContextInjectorInterceptor.func1\n\t/home/builder/temporal/common/metrics/grpc.go:66\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\t/go/pkg/mod/google.golang.org/grpc@v1.57.0/server.go:1179\ngo.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1\n\t/go/pkg/mod/go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc@v0.42.0/interceptor.go:344\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\t/go/pkg/mod/google.golang.org/grpc@v1.57.0/server.go:1179\ngo.temporal.io/server/common/rpc.ServiceErrorInterceptor\n\t/home/builder/temporal/common/rpc/grpc.go:145\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1\n\t/go/pkg/mod/google.golang.org/grpc@v1.57.0/server.go:1170\ngo.temporal.io/server/api/matchingservice/v1._MatchingService_GetTaskQueueUserData_Handler\n\t/home/builder/temporal/api/matchingservice/v1/service.pb.go:660\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.57.0/server.go:1360\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.57.0/server.go:1737\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.57.0/server.go:982"}

History: got a few other errors:

{"level":"error","ts":"2023-11-16T03:06:16.856Z","msg":"Unable to process new range","shard-id":3111,"address":"10.42.0.120:7234","component":"timer-queue-processor","error":"shard status unknown","logging-call-at":"queue_base.go:316","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:156\ngo.temporal.io/server/service/history/queues.(*queueBase).processNewRange\n\t/home/builder/temporal/service/history/queues/queue_base.go:316\ngo.temporal.io/server/service/history/queues.(*scheduledQueue).processEventLoop\n\t/home/builder/temporal/service/history/queues/queue_scheduled.go:218"}
{"level":"info","ts":"2023-11-16T03:58:53.445Z","msg":"matching client encountered error","service":"history","error":"Workflow is busy.","service-error-type":"serviceerror.ResourceExhausted","logging-call-at":"metric_client.go:219"}
{"level":"error","ts":"2023-11-16T03:06:09.240Z","msg":"Error acquiring shard","shard-id":1242,"address":"10.42.0.120:7234","error":"Failed to update shard.  previous_range_id: 1, columns: (range_id=2)","is-retryable":false,"logging-call-at":"context_impl.go:1874","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:156\ngo.temporal.io/server/service/history/shard.(*ContextImpl).acquireShard.func2.1\n\t/home/builder/temporal/service/history/shard/context_impl.go:1874\ngo.temporal.io/server/service/history/shard.(*ContextImpl).acquireShard.func2\n\t/home/builder/temporal/service/history/shard/context_impl.go:1881\ngo.temporal.io/server/common/backoff.ThrottleRetryContext\n\t/home/builder/temporal/common/backoff/retry.go:153\ngo.temporal.io/server/common/backoff.ThrottleRetry\n\t/home/builder/temporal/common/backoff/retry.go:120\ngo.temporal.io/server/service/history/shard.(*ContextImpl).acquireShard\n\t/home/builder/temporal/service/history/shard/context_impl.go:1889"}
{"level":"error","ts":"2023-11-16T03:06:09.240Z","msg":"Couldn't acquire shard","shard-id":1242,"address":"10.42.0.120:7234","error":"Failed to update shard.  previous_range_id: 1, columns: (range_id=2)","logging-call-at":"context_impl.go:1892","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:156\ngo.temporal.io/server/service/history/shard.(*ContextImpl).acquireShard\n\t/home/builder/temporal/service/history/shard/context_impl.go:1892"}

Did you all get anywhere with this one? I’m seeing these too after upgrading from 1.20.4 → 1.22.2

I got an answer to this one in Temporal slack:

It’s harmless, it’ll be fixed by Don't unload task queue while child is polling user data by dnr · Pull Request #5227 · temporalio/temporal · GitHub, in 1.23.0 or 1.23.1

@davidn @maxim @tihomir

Does this fix errors on history component as well?

We also face the similar issue on history component, below is our observation on our history component

{“level”:“error”,“ts”:“2024-02-12T14:49:36.841Z”,“msg”:“Unable to process new range”,“shard-id”:5945,“address”:“10.2.12.39:7234”,“component”:“timer-queue-processor”,“error”:“shard status unknown”,“logging-call-at”:“queue_base.go:316”}

{“level”:“error”,“ts”:“2024-02-13T17:37:53.515Z”,“msg”:“Queue reader unable to retrieve tasks”,“shard-id”:1476,“address”:“10.2.12.39:7234”,“component”:“transfer-queue-processor”,“queue-reader-id”:0,“error”:“operation GetTransferTasks encountered gocql: no response received from cassandra within timeout period”,“logging-call-at”:“reader.go:470”}

Hey! Any updates here? Now we are using 1.22.3 temporal server over two different persistences and encountering the same problems. Does update from PR fix the issue?

This seems to cause some workflows are not picked up by worker even-though the worker is connected and have already finished all previous workflows. Restart worker doesn’t help. The stuck workflows will be picked up after StartToClose timed out.

You’ll need to upgrade to 1.23.x. I don’t recall which version it was fixed in, but the problem is gone for us and I was able to turn logging back on.