Temporal-Server v1.22.0 Errors

Hi,

I would observer couple of common errors with the Temporal Server v1.22.0. DB - Cockroach and conigured Shards - 512. internal-frontend also enabled along with auth enabled.

may i know what is causing the issue?

1. Matching

{“level”:“error”,“ts”:“2023-09-28T17:12:33.460Z”,“msg”:“service failures”,“operation”:“GetTaskQueueUserData”,“wf-namespace”:“temporal-system”,“error”:“task queue closed”,“logging-call-at”:“telemetry.go:328”,“stacktrace”:“go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:156\ngo.temporal.io/server/common/rpc/interceptor.(*TelemetryInterceptor).handleError\n\t/temporal/common/rpc/interceptor/telemetry.go:328\ngo.temporal.io/server/common/rpc/interceptor.(*TelemetryInterceptor).UnaryIntercept\n\t/temporal/common/rpc/interceptor/telemetry.go:169\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\t/temporal/vendor/google.golang.org/grpc/server.go:1179\ngo.temporal.io/server/service.GrpcServerOptionsProvider.getUnaryInterceptors.NewServerMetricsTrailerPropagatorInterceptor.func4\n\t/temporal/common/metrics/grpc.go:113\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\t/temporal/vendor/google.golang.org/grpc/server.go:1179\ngo.temporal.io/server/service.GrpcServerOptionsProvider.getUnaryInterceptors.NewServerMetricsContextInjectorInterceptor.func3\n\t/temporal/common/metrics/grpc.go:66\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\t/temporal/vendor/google.golang.org/grpc/server.go:1179\ngo.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1\n\t/temporal/vendor/go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc/interceptor.go:344\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\t/temporal/vendor/google.golang.org/grpc/server.go:1179\ngo.temporal.io/server/common/rpc.ServiceErrorInterceptor\n\t/temporal/common/rpc/grpc.go:145\ngoogle.golang.org/grpc.NewServer.chainUnaryServerInterceptors.chainUnaryInterceptors.func1\n\t/temporal/vendor/google.golang.org/grpc/server.go:1170\ngo.temporal.io/server/api/matchingservice/v1._MatchingService_GetTaskQueueUserData_Handler\n\t/temporal/api/matchingservice/v1/service.pb.go:660\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/temporal/vendor/google.golang.org/grpc/server.go:1360\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/temporal/vendor/google.golang.org/grpc/server.go:1737\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.1\n\t/temporal/vendor/google.golang.org/grpc/server.go:982”}

2. History

{“level”:“error”,“ts”:“2023-09-28T17:07:43.031Z”,“msg”:“Unable to process new range”,“shard-id”:344,“address”:“100.127.123.67:7234”,“component”:“timer-queue-processor”,“error”:“shard status unknown”,“logging-call-at”:“queue_base.go:316”,“stacktrace”:“go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:156\ngo.temporal.io/server/service/history/queues.(*queueBase).processNewRange\n\t/temporal/service/history/queues/queue_base.go:316\ngo.temporal.io/server/service/history/queues.(*scheduledQueue).processEventLoop\n\t/temporal/service/history/queues/queue_scheduled.go:218”}

Hi @jaffarsadik , I’m using v1.22 as well and seeing similar errors.

Did you manage to find out the cause for these issues ? We are seeing similar things when load test temporal with Cassandra (3 nodes, temporal db’s replica is 1, LOCAL_QUORUM consistency)

This seems to cause some workflows are not picked up by worker even-though the worker is connected and have already finished all previous workflows. Restart worker doesn’t help. The stuck workflows will be picked up after StartToClose timed out.

Matching:

{"level":"error","ts":"2023-11-16T04:13:53.163Z","msg":"service failures","operation":"GetTaskQueueUserData","wf-namespace":"temporal-system","error":"task queue closed","logging-call-at":"telemetry.go:328","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:156\ngo.temporal.io/server/common/rpc/interceptor.(*TelemetryInterceptor).handleError\n\t/home/builder/temporal/common/rpc/interceptor/telemetry.go:328\ngo.temporal.io/server/common/rpc/interceptor.(*TelemetryInterceptor).UnaryIntercept\n\t/home/builder/temporal/common/rpc/interceptor/telemetry.go:169\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\t/go/pkg/mod/google.golang.org/grpc@v1.57.0/server.go:1179\ngo.temporal.io/server/common/metrics.NewServerMetricsTrailerPropagatorInterceptor.func1\n\t/home/builder/temporal/common/metrics/grpc.go:113\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\t/go/pkg/mod/google.golang.org/grpc@v1.57.0/server.go:1179\ngo.temporal.io/server/common/metrics.NewServerMetricsContextInjectorInterceptor.func1\n\t/home/builder/temporal/common/metrics/grpc.go:66\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\t/go/pkg/mod/google.golang.org/grpc@v1.57.0/server.go:1179\ngo.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1\n\t/go/pkg/mod/go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc@v0.42.0/interceptor.go:344\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\t/go/pkg/mod/google.golang.org/grpc@v1.57.0/server.go:1179\ngo.temporal.io/server/common/rpc.ServiceErrorInterceptor\n\t/home/builder/temporal/common/rpc/grpc.go:145\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1\n\t/go/pkg/mod/google.golang.org/grpc@v1.57.0/server.go:1170\ngo.temporal.io/server/api/matchingservice/v1._MatchingService_GetTaskQueueUserData_Handler\n\t/home/builder/temporal/api/matchingservice/v1/service.pb.go:660\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.57.0/server.go:1360\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.57.0/server.go:1737\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.57.0/server.go:982"}

History: got a few other errors:

{"level":"error","ts":"2023-11-16T03:06:16.856Z","msg":"Unable to process new range","shard-id":3111,"address":"10.42.0.120:7234","component":"timer-queue-processor","error":"shard status unknown","logging-call-at":"queue_base.go:316","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:156\ngo.temporal.io/server/service/history/queues.(*queueBase).processNewRange\n\t/home/builder/temporal/service/history/queues/queue_base.go:316\ngo.temporal.io/server/service/history/queues.(*scheduledQueue).processEventLoop\n\t/home/builder/temporal/service/history/queues/queue_scheduled.go:218"}
{"level":"info","ts":"2023-11-16T03:58:53.445Z","msg":"matching client encountered error","service":"history","error":"Workflow is busy.","service-error-type":"serviceerror.ResourceExhausted","logging-call-at":"metric_client.go:219"}
{"level":"error","ts":"2023-11-16T03:06:09.240Z","msg":"Error acquiring shard","shard-id":1242,"address":"10.42.0.120:7234","error":"Failed to update shard.  previous_range_id: 1, columns: (range_id=2)","is-retryable":false,"logging-call-at":"context_impl.go:1874","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:156\ngo.temporal.io/server/service/history/shard.(*ContextImpl).acquireShard.func2.1\n\t/home/builder/temporal/service/history/shard/context_impl.go:1874\ngo.temporal.io/server/service/history/shard.(*ContextImpl).acquireShard.func2\n\t/home/builder/temporal/service/history/shard/context_impl.go:1881\ngo.temporal.io/server/common/backoff.ThrottleRetryContext\n\t/home/builder/temporal/common/backoff/retry.go:153\ngo.temporal.io/server/common/backoff.ThrottleRetry\n\t/home/builder/temporal/common/backoff/retry.go:120\ngo.temporal.io/server/service/history/shard.(*ContextImpl).acquireShard\n\t/home/builder/temporal/service/history/shard/context_impl.go:1889"}
{"level":"error","ts":"2023-11-16T03:06:09.240Z","msg":"Couldn't acquire shard","shard-id":1242,"address":"10.42.0.120:7234","error":"Failed to update shard.  previous_range_id: 1, columns: (range_id=2)","logging-call-at":"context_impl.go:1892","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:156\ngo.temporal.io/server/service/history/shard.(*ContextImpl).acquireShard\n\t/home/builder/temporal/service/history/shard/context_impl.go:1892"}