Temporal 1.20.3 "unable to initialize cassandra session" with cql-proxy and astraDb at startup on random pods

Hello,
I’m using temporal 1.20.3 with cql-proxy as sidecar and astraDb on an EKS cluster (1.23) in multi az.
Sometimes, not always, when I start a new pod of any temporal component, it may happen that it starts to crash loop with the following message “unable to initialize cassandra session” and after several restarts it starts normally, sometimes the restarts are in the order of hundreds.

this is the message:

“TEMPORAL_CLI_ADDRESS is not set, setting it to 100.64.12.237:7233\n”
“2023/07/07 08:49:06 Loading config; env=docker,zone=,configDir=config\n”
“2023/07/07 08:49:06 Loading config files=[config/docker.yaml]\n”
{“log”:“{"level":"info","ts":"2023-07-07T08:49:06.477Z","msg":"Build info.","git-time":"2023-05-15T23:50:55.000Z","git-revision":"45d22540323e59e4cd3fd62139b73409f1264fb3","git-modified":true,"go-arch":"amd64","go-os":"linux","go-version":"go1.20.4","cgo-enabled":false,"server-version":"1.20.3","debug-mode":false,"logging-call-at":"main.go:143"}\n”,“stream”:“stdout”,“time”:“2023-07-07T08:49:06.477756462Z”}
{“log”:“{"level":"warn","ts":"2023-07-07T08:49:06.478Z","msg":"Not using any authorizer and flag --allow-no-auth not detected. Future versions will require using the flag --allow-no-auth if you do not want to set an authorizer.","logging-call-at":"main.go:173"}\n”,“stream”:“stdout”,“time”:“2023-07-07T08:49:06.478709634Z”}
{“log”:“{"level":"fatal","ts":"2023-07-07T08:49:06.734Z","msg":"unable to initialize cassandra session","component":"metadata-initializer","error":"no connections were made when creating the session","logging-call-at":"factory.go:66","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Fatal\n\t/home/builder/temporal/common/log/zap_logger.go:174\ngo.temporal.io/server/common/persistence/cassandra.NewFactory\n\t/home/builder/temporal/common/persistence/cassandra/factory.go:66\ngo.temporal.io/server/common/persistence/client.DataStoreFactoryProvider\n\t/home/builder/temporal/common/persistence/client/store.go:82\ngo.temporal.io/server/temporal.ApplyClusterMetadataConfigProvider\n\t/home/builder/temporal/temporal/fx.go:621\nreflect.Value.call\n\t/usr/local/go/src/reflect/value.go:586\nreflect.Value.Call\n\t/usr/local/go/src/reflect/value.go:370\ngo.uber.org/dig.defaultInvoker\n\t/go/pkg/mod/go.uber.org/dig@v1.15.0/container.go:220\ngo.uber.org/dig.(*constructorNode).Call\n\t/go/pkg/mod/go.uber.org/dig@v1.15.0/constructor.go:154\ngo.uber.org/dig.paramSingle.Build\n\t/go/pkg/mod/go.uber.org/dig@v1.15.0/param.go:288\ngo.uber.org/dig.paramObjectField.Build\n\t/go/pkg/mod/go.uber.org/dig@v1.15.0/param.go:485\ngo.uber.org/dig.paramObject.Build\n\t/go/pkg/mod/go.uber.org/dig@v1.15.0/param.go:412\ngo.uber.org/dig.paramList.BuildList\n\t/go/pkg/mod/go.uber.org/dig@v1.15.0/param.go:151\ngo.uber.org/dig.(*constructorNode).Call\n\t/go/pkg/mod/go.uber.org/dig@v1.15.0/constructor.go:145\ngo.uber.org/dig.paramGroupedSlice.callGroupProviders\n\t/go/pkg/mod/go.uber.org/dig@v1.15.0/param.go:612\ngo.uber.org/dig.paramGroupedSlice.Build\n\t/go/pkg/mod/go.uber.org/dig@v1.15.0/param.go:642\ngo.uber.org/dig.paramObjectField.Build\n\t/go/pkg/mod/go.uber.org/dig@v1.15.0/param.go:485\ngo.uber.org/dig.paramObject.Build\n\t/go/pkg/mod/go.uber.org/dig@v1.15.0/param.go:412\ngo.uber.org/dig.paramList.BuildList\n\t/go/pkg/mod/go.uber.org/dig@v1.15.0/param.go:151\ngo.uber.org/dig.(*constructorNode).Call\n\t/go/pkg/mod/go.uber.org/dig@v1.15.0/constructor.go:145\ngo.uber.org/dig.paramSingle.Build\n\t/go/pkg/mod/go.uber.org/dig@v1.15.0/param.go:288\ngo.uber.org/dig.paramList.BuildList\n\t/go/pkg/mod/go.uber.org/dig@v1.15.0/param.go:151\ngo.uber.org/dig.(*constructorNode).Call\n\t/go/pkg/mod/go.uber.org/dig@v1.15.0/constructor.go:145\ngo.uber.org/dig.paramSingle.Build\n\t/go/pkg/mod/go.uber.org/dig@v1.15.0/param.go:288\ngo.uber.org/dig.paramList.BuildList\n\t/go/pkg/mod/go.uber.org/dig@v1.15.0/param.go:151\ngo.uber.org/dig.(*Scope).Invoke\n\t/go/pkg/mod/go.uber.org/dig@v1.15.0/invoke.go:85\ngo.uber.org/dig.(*Container).Invoke\n\t/go/pkg/mod/go.uber.org/dig@v1.15.0/invoke.go:46\ngo.uber.org/fx.runInvoke\n\t/go/pkg/mod/go.uber.org/fx@v1.18.2/invoke.go:108\ngo.uber.org/fx.(*module).executeInvoke\n\t/go/pkg/mod/go.uber.org/fx@v1.18.2/module.go:186\ngo.uber.org/fx.(*module).executeInvokes\n\t/go/pkg/mod/go.uber.org/fx@v1.18.2/module.go:172\ngo.uber.org/fx.New\n\t/go/pkg/mod/go.uber.org/fx@v1.18.2/app.go:530\ngo.temporal.io/server/temporal.NewServerFx\n\t/home/builder/temporal/temporal/fx.go:135\ngo.temporal.io/server/temporal.NewServer\n\t/home/builder/temporal/temporal/server.go:69\nmain.buildCLI.func2\n\t/home/builder/temporal/cmd/server/main.go:184\ngithub.com/urfave/cli/v2.(*Command).Run\n\t/go/pkg/mod/github.com/urfave/cli/v2@v2.4.0/command.go:163\ngithub.com/urfave/cli/v2.(*App).RunContext\n\t/go/pkg/mod/github.com/urfave/cli/v2@v2.4.0/app.go:313\ngithub.com/urfave/cli/v2.(*App).Run\n\t/go/pkg/mod/github.com/urfave/cli/v2@v2.4.0/app.go:224\nmain.main\n\t/home/builder/temporal/cmd/server/main.go:54\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250"}\n”,“stream”:“stdout”,“time”:“2023-07-07T08:49:06.73526558Z”}

I also made sure to start temporal when cql-proxy was ready with the following command:

“until curl --silent --fail http://localhost:8000/readiness 2>&1 > /dev/null; do echo waiting for cql-proxy to start; sleep 1; done; ./entrypoint.sh”

i also tried port forwarding to the cql-proxy of an affected pod, but i can connect normally with cqlsh and do queries.

I didn’t have this kind of behavior before upgrading to 1.20, i was using 1.17.
At the moment, when I notice the problem I kill the pod until it starts properly.

do you have any idea what could be the problem? can be k8s node related? it seems that the ones that have the problem are on the same node.

es:

Thanks.

here another log with a different message:

TEMPORAL_CLI_ADDRESS is not set, setting it to 100.64.2.192:7233
2023/07/07 10:59:21 Loading config; env=docker,zone=,configDir=config
2023/07/07 10:59:21 Loading config files=[config/docker.yaml]
{“level”:“info”,“ts”:“2023-07-07T10:59:21.988Z”,“msg”:“Build info.”,“git-time”:“2023-05-15T23:50:55.000Z”,“git-revision”:“45d22540323e59e4cd3fd62139b73409f1264fb3”,“git-modified”:true,“go-arch”:“amd64”,“go-os”:“linux”,“go-version”:“go1.20.4”,“cgo-enabled”:false,“server-version”:“1.20.3”,“debug-mode”:false,“logging-call-at”:“main.go:143”}
{“level”:“info”,“ts”:“2023-07-07T10:59:21.989Z”,“msg”:“dynamic config changed for the key: matching.numtaskqueuereadpartitions oldValue: nil newValue: { constraints: {} value: 3 }”,“logging-call-at”:“file_based_client.go:275”}
{“level”:“info”,“ts”:“2023-07-07T10:59:21.989Z”,“msg”:“dynamic config changed for the key: matching.numtaskqueuereadpartitions oldValue: nil newValue: { constraints: {{Namespace:test}} value: 6 }”,“logging-call-at”:“file_based_client.go:275”}
{“level”:“info”,“ts”:“2023-07-07T10:59:21.989Z”,“msg”:“dynamic config changed for the key: matching.numtaskqueuewritepartitions oldValue: nil newValue: { constraints: {} value: 3 }”,“logging-call-at”:“file_based_client.go:275”}
{“level”:“info”,“ts”:“2023-07-07T10:59:21.989Z”,“msg”:“dynamic config changed for the key: matching.numtaskqueuewritepartitions oldValue: nil newValue: { constraints: {{Namespace:test}} value: 6 }”,“logging-call-at”:“file_based_client.go:275”}
{“level”:“info”,“ts”:“2023-07-07T10:59:21.989Z”,“msg”:“Updated dynamic config”,“logging-call-at”:“file_based_client.go:195”}
{“level”:“warn”,“ts”:“2023-07-07T10:59:21.989Z”,“msg”:“Not using any authorizer and flag --allow-no-auth not detected. Future versions will require using the flag --allow-no-auth if you do not want to set an authorizer.”,“logging-call-at”:“main.go:173”}
[Fx] PROVIDE *pprof.PProfInitializerImpl <= go.temporal.io/server/common/pprof.NewInitializer()
[Fx] PROVIDE *temporal.ServerImpl <= go.temporal.io/server/temporal.NewServerFxImpl()
[Fx] PROVIDE temporal.Server <= go.temporal.io/server/temporal.glob..func9()
[Fx] SUPPLY temporal.ServerOption
[Fx] PROVIDE *temporal.serverOptions <= go.temporal.io/server/temporal.ServerOptionsProvider()
[Fx] PROVIDE chan interface {} <= go.temporal.io/server/temporal.ServerOptionsProvider()
[Fx] PROVIDE *config.Config <= go.temporal.io/server/temporal.ServerOptionsProvider()
[Fx] PROVIDE *config.PProf <= go.temporal.io/server/temporal.ServerOptionsProvider()
[Fx] PROVIDE log.Config <= go.temporal.io/server/temporal.ServerOptionsProvider()
[Fx] PROVIDE resource.ServiceNames <= go.temporal.io/server/temporal.ServerOptionsProvider()
[Fx] PROVIDE resource.NamespaceLogger <= go.temporal.io/server/temporal.ServerOptionsProvider()
[Fx] PROVIDE resolver.ServiceResolver <= go.temporal.io/server/temporal.ServerOptionsProvider()
[Fx] PROVIDE client.AbstractDataStoreFactory <= go.temporal.io/server/temporal.ServerOptionsProvider()
[Fx] PROVIDE searchattribute.Mapper <= go.temporal.io/server/temporal.ServerOptionsProvider()
[Fx] PROVIDE grpc.UnaryServerInterceptor <= go.temporal.io/server/temporal.ServerOptionsProvider()
[Fx] PROVIDE authorization.Authorizer <= go.temporal.io/server/temporal.ServerOptionsProvider()
[Fx] PROVIDE authorization.ClaimMapper <= go.temporal.io/server/temporal.ServerOptionsProvider()
[Fx] PROVIDE authorization.JWTAudienceMapper <= go.temporal.io/server/temporal.ServerOptionsProvider()
[Fx] PROVIDE log.Logger <= go.temporal.io/server/temporal.ServerOptionsProvider()
[Fx] PROVIDE client.FactoryProvider <= go.temporal.io/server/temporal.ServerOptionsProvider()
[Fx] PROVIDE dynamicconfig.Client <= go.temporal.io/server/temporal.ServerOptionsProvider()
[Fx] PROVIDE *dynamicconfig.Collection <= go.temporal.io/server/temporal.ServerOptionsProvider()
[Fx] PROVIDE encryption.TLSConfigProvider <= go.temporal.io/server/temporal.ServerOptionsProvider()
[Fx] PROVIDE *client.Config <= go.temporal.io/server/temporal.ServerOptionsProvider()
[Fx] PROVIDE client.Client <= go.temporal.io/server/temporal.ServerOptionsProvider()
[Fx] PROVIDE metrics.Handler <= go.temporal.io/server/temporal.ServerOptionsProvider()
[Fx] PROVIDE trace.SpanExporter <= go.temporal.io/server/temporal.glob..func2()
[Fx] PROVIDE client.FactoryProviderFn <= go.temporal.io/server/temporal.PersistenceFactoryProvider()
[Fx] PROVIDE *temporal.ServicesMetadata[group = “services”] <= go.temporal.io/server/temporal.HistoryServiceProvider()
[Fx] PROVIDE *temporal.ServicesMetadata[group = “services”] <= go.temporal.io/server/temporal.MatchingServiceProvider()
[Fx] PROVIDE *temporal.ServicesMetadata[group = “services”] <= go.temporal.io/server/temporal.FrontendServiceProvider()
[Fx] PROVIDE *temporal.ServicesMetadata[group = “services”] <= go.temporal.io/server/temporal.InternalFrontendServiceProvider()
[Fx] PROVIDE *temporal.ServicesMetadata[group = “services”] <= go.temporal.io/server/temporal.WorkerServiceProvider()
[Fx] PROVIDE *cluster.Config <= go.temporal.io/server/temporal.ApplyClusterMetadataConfigProvider()
[Fx] PROVIDE config.Persistence <= go.temporal.io/server/temporal.ApplyClusterMetadataConfigProvider()
[Fx] PROVIDE fx.Lifecycle <= go.uber.org/fx.New.func1()
[Fx] PROVIDE fx.Shutdowner <= go.uber.org/fx.(*App).shutdowner-fm()
[Fx] PROVIDE fx.DotGraph <= go.uber.org/fx.(*App).dotGraph-fm()
[Fx] ERROR Failed to initialize custom logger: could not build arguments for function “go.uber.org/fx”.(*App).constructCustomLogger.func2
/go/pkg/mod/go.uber.org/fx@v1.18.2/app.go:414:
failed to build fxevent.Logger:
could not build arguments for function “go.temporal.io/server/temporal”.glob…func8
/home/builder/temporal/temporal/fx.go:1025:
failed to build log.Logger:
received non-nil error from function “go.temporal.io/server/temporal”.ServerOptionsProvider
/home/builder/temporal/temporal/fx.go:159:
cassandra schema version compatibility check failed: no connections were made when creating the session
Unable to create server. Error: could not build arguments for function “go.uber.org/fx”.(*App).constructCustomLogger.func2 (/go/pkg/mod/go.uber.org/fx@v1.18.2/app.go:414): failed to build fxevent.Logger: could not build arguments for function “go.temporal.io/server/temporal”.glob…func8 (/home/builder/temporal/temporal/fx.go:1025): failed to build log.Logger: received non-nil error from function “go.temporal.io/server/temporal”.ServerOptionsProvider (/home/builder/temporal/temporal/fx.go:159): cassandra schema version compatibility check failed: no connections were made when creating the session.

thanks.

same here with temporal 1.20.4 on a kubernetes cluster.