Temporal-Server v1.22.0 Errors

jaffarsadik · September 28, 2023, 5:16pm

Hi,

I would observer couple of common errors with the Temporal Server v1.22.0. DB - Cockroach and conigured Shards - 512. internal-frontend also enabled along with auth enabled.

may i know what is causing the issue?

1. Matching

{“level”:“error”,“ts”:“2023-09-28T17:12:33.460Z”,“msg”:“service failures”,“operation”:“GetTaskQueueUserData”,“wf-namespace”:“temporal-system”,“error”:“task queue closed”,“logging-call-at”:“telemetry.go:328”,“stacktrace”:“go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:156\ngo.temporal.io/server/common/rpc/interceptor.(*TelemetryInterceptor).handleError\n\t/temporal/common/rpc/interceptor/telemetry.go:328\ngo.temporal.io/server/common/rpc/interceptor.(*TelemetryInterceptor).UnaryIntercept\n\t/temporal/common/rpc/interceptor/telemetry.go:169\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\t/temporal/vendor/google.golang.org/grpc/server.go:1179\ngo.temporal.io/server/service.GrpcServerOptionsProvider.getUnaryInterceptors.NewServerMetricsTrailerPropagatorInterceptor.func4\n\t/temporal/common/metrics/grpc.go:113\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\t/temporal/vendor/google.golang.org/grpc/server.go:1179\ngo.temporal.io/server/service.GrpcServerOptionsProvider.getUnaryInterceptors.NewServerMetricsContextInjectorInterceptor.func3\n\t/temporal/common/metrics/grpc.go:66\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\t/temporal/vendor/google.golang.org/grpc/server.go:1179\ngo.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1\n\t/temporal/vendor/go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc/interceptor.go:344\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\t/temporal/vendor/google.golang.org/grpc/server.go:1179\ngo.temporal.io/server/common/rpc.ServiceErrorInterceptor\n\t/temporal/common/rpc/grpc.go:145\ngoogle.golang.org/grpc.NewServer.chainUnaryServerInterceptors.chainUnaryInterceptors.func1\n\t/temporal/vendor/google.golang.org/grpc/server.go:1170\ngo.temporal.io/server/api/matchingservice/v1._MatchingService_GetTaskQueueUserData_Handler\n\t/temporal/api/matchingservice/v1/service.pb.go:660\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/temporal/vendor/google.golang.org/grpc/server.go:1360\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/temporal/vendor/google.golang.org/grpc/server.go:1737\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.1\n\t/temporal/vendor/google.golang.org/grpc/server.go:982”}

2. History

{“level”:“error”,“ts”:“2023-09-28T17:07:43.031Z”,“msg”:“Unable to process new range”,“shard-id”:344,“address”:“100.127.123.67:7234”,“component”:“timer-queue-processor”,“error”:“shard status unknown”,“logging-call-at”:“queue_base.go:316”,“stacktrace”:“go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:156\ngo.temporal.io/server/service/history/queues.(*queueBase).processNewRange\n\t/temporal/service/history/queues/queue_base.go:316\ngo.temporal.io/server/service/history/queues.(*scheduledQueue).processEventLoop\n\t/temporal/service/history/queues/queue_scheduled.go:218”}

mikettn · November 16, 2023, 4:27am

Hi @jaffarsadik , I’m using v1.22 as well and seeing similar errors.

Did you manage to find out the cause for these issues ? We are seeing similar things when load test temporal with Cassandra (3 nodes, temporal db’s replica is 1, LOCAL_QUORUM consistency)

This seems to cause some workflows are not picked up by worker even-though the worker is connected and have already finished all previous workflows. Restart worker doesn’t help. The stuck workflows will be picked up after StartToClose timed out.

Matching:

{"level":"error","ts":"2023-11-16T04:13:53.163Z","msg":"service failures","operation":"GetTaskQueueUserData","wf-namespace":"temporal-system","error":"task queue closed","logging-call-at":"telemetry.go:328","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:156\ngo.temporal.io/server/common/rpc/interceptor.(*TelemetryInterceptor).handleError\n\t/home/builder/temporal/common/rpc/interceptor/telemetry.go:328\ngo.temporal.io/server/common/rpc/interceptor.(*TelemetryInterceptor).UnaryIntercept\n\t/home/builder/temporal/common/rpc/interceptor/telemetry.go:169\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\t/go/pkg/mod/google.golang.org/grpc@v1.57.0/server.go:1179\ngo.temporal.io/server/common/metrics.NewServerMetricsTrailerPropagatorInterceptor.func1\n\t/home/builder/temporal/common/metrics/grpc.go:113\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\t/go/pkg/mod/google.golang.org/grpc@v1.57.0/server.go:1179\ngo.temporal.io/server/common/metrics.NewServerMetricsContextInjectorInterceptor.func1\n\t/home/builder/temporal/common/metrics/grpc.go:66\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\t/go/pkg/mod/google.golang.org/grpc@v1.57.0/server.go:1179\ngo.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1\n\t/go/pkg/mod/go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc@v0.42.0/interceptor.go:344\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\t/go/pkg/mod/google.golang.org/grpc@v1.57.0/server.go:1179\ngo.temporal.io/server/common/rpc.ServiceErrorInterceptor\n\t/home/builder/temporal/common/rpc/grpc.go:145\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1\n\t/go/pkg/mod/google.golang.org/grpc@v1.57.0/server.go:1170\ngo.temporal.io/server/api/matchingservice/v1._MatchingService_GetTaskQueueUserData_Handler\n\t/home/builder/temporal/api/matchingservice/v1/service.pb.go:660\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.57.0/server.go:1360\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.57.0/server.go:1737\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.57.0/server.go:982"}

History: got a few other errors:

{"level":"error","ts":"2023-11-16T03:06:16.856Z","msg":"Unable to process new range","shard-id":3111,"address":"10.42.0.120:7234","component":"timer-queue-processor","error":"shard status unknown","logging-call-at":"queue_base.go:316","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:156\ngo.temporal.io/server/service/history/queues.(*queueBase).processNewRange\n\t/home/builder/temporal/service/history/queues/queue_base.go:316\ngo.temporal.io/server/service/history/queues.(*scheduledQueue).processEventLoop\n\t/home/builder/temporal/service/history/queues/queue_scheduled.go:218"}

{"level":"info","ts":"2023-11-16T03:58:53.445Z","msg":"matching client encountered error","service":"history","error":"Workflow is busy.","service-error-type":"serviceerror.ResourceExhausted","logging-call-at":"metric_client.go:219"}

{"level":"error","ts":"2023-11-16T03:06:09.240Z","msg":"Error acquiring shard","shard-id":1242,"address":"10.42.0.120:7234","error":"Failed to update shard.  previous_range_id: 1, columns: (range_id=2)","is-retryable":false,"logging-call-at":"context_impl.go:1874","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:156\ngo.temporal.io/server/service/history/shard.(*ContextImpl).acquireShard.func2.1\n\t/home/builder/temporal/service/history/shard/context_impl.go:1874\ngo.temporal.io/server/service/history/shard.(*ContextImpl).acquireShard.func2\n\t/home/builder/temporal/service/history/shard/context_impl.go:1881\ngo.temporal.io/server/common/backoff.ThrottleRetryContext\n\t/home/builder/temporal/common/backoff/retry.go:153\ngo.temporal.io/server/common/backoff.ThrottleRetry\n\t/home/builder/temporal/common/backoff/retry.go:120\ngo.temporal.io/server/service/history/shard.(*ContextImpl).acquireShard\n\t/home/builder/temporal/service/history/shard/context_impl.go:1889"}

{"level":"error","ts":"2023-11-16T03:06:09.240Z","msg":"Couldn't acquire shard","shard-id":1242,"address":"10.42.0.120:7234","error":"Failed to update shard.  previous_range_id: 1, columns: (range_id=2)","logging-call-at":"context_impl.go:1892","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:156\ngo.temporal.io/server/service/history/shard.(*ContextImpl).acquireShard\n\t/home/builder/temporal/service/history/shard/context_impl.go:1892"}

davidn · December 19, 2023, 5:52pm

Did you all get anywhere with this one? I’m seeing these too after upgrading from 1.20.4 → 1.22.2

davidn · December 19, 2023, 7:08pm

I got an answer to this one in Temporal slack:

It’s harmless, it’ll be fixed by Don't unload task queue while child is polling user data by dnr · Pull Request #5227 · temporalio/temporal · GitHub, in 1.23.0 or 1.23.1

Kishore_Gunda · February 14, 2024, 10:12am

@davidn @maxim @tihomir

Does this fix errors on history component as well?

We also face the similar issue on history component, below is our observation on our history component

{“level”:“error”,“ts”:“2024-02-12T14:49:36.841Z”,“msg”:“Unable to process new range”,“shard-id”:5945,“address”:“10.2.12.39:7234”,“component”:“timer-queue-processor”,“error”:“shard status unknown”,“logging-call-at”:“queue_base.go:316”}

{“level”:“error”,“ts”:“2024-02-13T17:37:53.515Z”,“msg”:“Queue reader unable to retrieve tasks”,“shard-id”:1476,“address”:“10.2.12.39:7234”,“component”:“transfer-queue-processor”,“queue-reader-id”:0,“error”:“operation GetTransferTasks encountered gocql: no response received from cassandra within timeout period”,“logging-call-at”:“reader.go:470”}

Ecl1pse · March 27, 2024, 3:30pm

Hey! Any updates here? Now we are using 1.22.3 temporal server over two different persistences and encountering the same problems. Does update from PR fix the issue?

This seems to cause some workflows are not picked up by worker even-though the worker is connected and have already finished all previous workflows. Restart worker doesn’t help. The stuck workflows will be picked up after StartToClose timed out.

davidn · September 23, 2024, 5:05pm

You’ll need to upgrade to 1.23.x. I don’t recall which version it was fixed in, but the problem is gone for us and I was able to turn logging back on.

Topic		Replies	Views
Service failures with operation GetTaskQueueUserData Community Support java-sdk	0	218	May 9, 2024
Errors reported on Temporal server with heavy load but workflows succeed Community Support performance	6	1201	August 18, 2022
Errors in temporal history and matching service logs Community Support cassandra , deployment	2	1199	July 7, 2022
Tls errors in the temporal logs Server Deployment	2	544	July 21, 2023
Front end errors, temporal server unstable during the error window Community Support frontend	1	715	March 2, 2023

Temporal-Server v1.22.0 Errors

Related topics