Temporal-service 1.14.4 upgraded, frontend & worker pods are crashing

Hi,

Currently I have upgraded to the latest version of the temporal service 1.14.4 and trying to deploy k8s with the existing the database(cockroach), and I have taken necessary steps on the database and clean up the cluster-metadata & cluster-membership tables. Please let me know what would be causing the issue?

clusterMetadata:
enableGlobalNamespace: false
failoverVersionIncrement: 10
masterClusterName: “active”
currentClusterName: “active”
clusterInformation:
active:
enabled: true
initialFailoverVersion: 1
rpcName: “frontend”
rpcAddress: “frontend:7233”

I am getting the below two errors in the frontend pod.

Error-1

{“level”:“error”,“ts”:“2022-02-01T18:09:34.181Z”,“msg”:“Supplied configuration key/value mismatches persisted cluster metadata. Continuing with the persisted value as this value cannot be changed once initialized.”,“component”:“metadata-initializer”,“key”:“clusterInformation.RPCAddress”,“ignored-value”:{“Enabled”:true,“InitialFailoverVersion”:1,“RPCAddress”:“100.127.24.223:7233”},“value”:{“Enabled”:true,“InitialFailoverVersion”:1,“RPCAddress”:“100.127.92.163:7233”},“logging-call-at”:“fx.go:713”,“stacktrace”:“go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:142\ngo.temporal.io/server/temporal.loadClusterInformationFromStore\n\t/temporal/temporal/fx.go:713\ngo.temporal.io/server/temporal.ApplyClusterMetadataConfigProvider\n\t/temporal/temporal/fx.go:669\nreflect.Value.call\n\t/usr/local/go/src/reflect/value.go:476\nreflect.Value.Call\n\t/usr/local/go/src/reflect/value.go:337\ngo.uber.org/dig.defaultInvoker\n\t/temporal/vendor/go.uber.org/dig/dig.go:439\ngo.uber.org/dig.(*node).Call\n\t/temporal/vendor/go.uber.org/dig/dig.go:912\ngo.uber.org/dig.paramSingle.Build\n\t/temporal/vendor/go.uber.org/dig/param.go:240\ngo.uber.org/dig.paramObjectField.Build\n\t/temporal/vendor/go.uber.org/dig/param.go:396\ngo.uber.org/dig.paramObject.Build\n\t/temporal/vendor/go.uber.org/dig/param.go:323\ngo.uber.org/dig.paramList.BuildList\n\t/temporal/vendor/go.uber.org/dig/param.go:196\ngo.uber.org/dig.(*node).Call\n\t/temporal/vendor/go.uber.org/dig/dig.go:903\ngo.uber.org/dig.paramGroupedSlice.Build\n\t/temporal/vendor/go.uber.org/dig/param.go:458\ngo.uber.org/dig.paramObjectField.Build\n\t/temporal/vendor/go.uber.org/dig/param.go:396\ngo.uber.org/dig.paramObject.Build\n\t/temporal/vendor/go.uber.org/dig/param.go:323\ngo.uber.org/dig.paramList.BuildList\n\t/temporal/vendor/go.uber.org/dig/param.go:196\ngo.uber.org/dig.(*node).Call\n\t/temporal/vendor/go.uber.org/dig/dig.go:903\ngo.uber.org/dig.paramSingle.Build\n\t/temporal/vendor/go.uber.org/dig/param.go:240\ngo.uber.org/dig.paramList.BuildList\n\t/temporal/vendor/go.uber.org/dig/param.go:196\ngo.uber.org/dig.(*node).Call\n\t/temporal/vendor/go.uber.org/dig/dig.go:903\ngo.uber.org/dig.paramSingle.Build\n\t/temporal/vendor/go.uber.org/dig/param.go:240\ngo.uber.org/dig.paramList.BuildList\n\t/temporal/vendor/go.uber.org/dig/param.go:196\ngo.uber.org/dig.(*Container).Invoke\n\t/temporal/vendor/go.uber.org/dig/dig.go:587\ngo.uber.org/fx.(*App).executeInvoke\n\t/temporal/vendor/go.uber.org/fx/app.go:873\ngo.uber.org/fx.(*App).executeInvokes\n\t/temporal/vendor/go.uber.org/fx/app.go:846\ngo.uber.org/fx.New\n\t/temporal/vendor/go.uber.org/fx/app.go:594\ngo.temporal.io/server/temporal.NewServerFx\n\t/temporal/temporal/fx.go:97\ngo.temporal.io/server/temporal.NewServer\n\t/temporal/temporal/server.go:58\nmain.buildCLI.func2\n\t/temporal/cmd/server/main.go:163\ngithub.com/urfave/cli/v2.(*Command).Run\n\t/temporal/vendor/github.com/urfave/cli/v2/command.go:163\ngithub.com).RunContext\n\t/temporal/vendor/github.com/urfave/cli/v2/app.go:313\ngithub.com).Run\n\t/temporal/vendor/github.com/urfave/cli/v2/app.go:224\nmain.main\n\t/temporal/cmd/server/main.go:51\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:225”}

Error-2

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1ce6034]
goroutine 1 [running]:
go.temporal.io/server/common/rpc/encryption.newServerTLSConfig(0x292f638, 0xc0006343f0, 0x291b970, 0xc000e4fb70, 0xc0000eaad0, 0x292f3f8, 0xc0009a73c0, 0x8, 0x2540be400, 0x0)
/temporal/common/rpc/encryption/localStoreTlsProvider.go:255 +0x134
go.temporal.io/server/common/rpc/encryption.(*localStoreTlsProvider).GetFrontendServerConfig.func1(0xc000258cc0, 0x252f671, 0xd)
/temporal/common/rpc/encryption/localStoreTlsProvider.go:162 +0x7a
go.temporal.io/server/common/rpc/encryption.(*localStoreTlsProvider).getOrCreateConfig(0xc000258cc0, 0xc000258d38, 0xc000d51de8, 0x254c001, 0x0, 0x0, 0x0)
/temporal/common/rpc/encryption/localStoreTlsProvider.go:232 +0xd3
go.temporal.io/server/common/rpc/encryption.(*localStoreTlsProvider).GetFrontendServerConfig(0xc000258cc0, 0xc00088eee8, 0x2341820, 0xc00088ef00)
/temporal/common/rpc/encryption/localStoreTlsProvider.go:159 +0x72
go.temporal.io/server/common/rpc.(*RPCFactory).GetFrontendGRPCServerOptions(0xc0001841e0, 0x0, 0x0, 0x2540be400, 0xc000d51ed0, 0x40375a)
/temporal/common/rpc/rpc.go:74 +0x67
go.temporal.io/server/service/frontend.GrpcServerOptionsProvider(0x292f3f8, 0xc0009a73c0, 0xc001001200, 0x29335c8, 0xc0001841e0, 0xc000d0a1e0, 0xc000d68750, 0xc000140500, 0xc000d68c30, 0xc000140600, …)
/temporal/service/frontend/fx.go:152 +0x1f0

Hi,
The first error should be at the warning level. we already update it in our latest master. I think the service crash is due to the second panic error. The common case of this error is triggered by a invalid TLS config. Do you recently update the tls config or enable tls? Is it possible to share your tls config?

Hi Yux,
Yesterday i have taken the maser code,
first error - can you please share the PR (code to fix).
second error -
I have not updated the tls config.

datastores:
gcdb-default:
sql:
user: “##GCDB_USER##”
password: “”
pluginName: “postgres”
databaseName: “##GCDB_DATABASE_NAME##”
connectAddr: “##GCDB_CONNECT_ADDR##”
connectProtocol: “tcp”
tls:
enabled: true
caFile: “##CA_CERT_FILE##”
enableHostVerification: false
gcdb-visibility:
sql:
user: “##GCDB_USER##”
password: “”
pluginName: “postgres”
databaseName: “##GCDB_VISIBILITY_DATABASE_NAME##”
connectAddr: “##GCDB_CONNECT_ADDR##”
connectProtocol: “tcp”
tls:
enabled: true
caFile: “##CA_CERT_FILE##”
enableHostVerification: false
tls:
frontend:
client:
rootCaFiles:
- “/etc/temporal/certs/entrust.cer”

file location:

bash-4.2$ ls -ltr /etc/temporal/certs/entrust.cer
lrwxrwxrwx. 1 root root 18 Feb 2 01:30 /etc/temporal/certs/entrust.cer → …data/entrust.cer

logs:

{“level”:“info”,“ts”:“2022-02-02T01:34:18.107Z”,“msg”:“loading CA certs from”,“tls-cert-files”:["/etc/temporal/certs/entrust.cer"],“logging-call-at”:“localStoreCertProvider.go:462”}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1ce6034]
goroutine 1 [running]:
go.temporal.io/server/common/rpc/encryption.newServerTLSConfig(0x292f638, 0xc0004ba380, 0x291b970, 0xc000504350, 0xc000ab80d0, 0x292f3f8, 0xc0000ba2f0, 0x8, 0x2540be400, 0x0)
/temporal/common/rpc/encryption/localStoreTlsProvider.go:255 +0x134

Do you have any cert under the server section? I just create an issue for this. Invalid TLS config may crash service during service start · Issue #2448 · temporalio/temporal · GitHub

we don’t have server section.
temporal-service v1.12.0 works with client rootCaFiles as mentioned above.
@yux - It seems tls code changes updated in v1.13.0, same issue reproduced in v1.13.0.

@SergeyBykov do you have any timelines to fix this issue or any beta version release if already fixed?

@jaffarsadik I submitted Treat enablement of TLS separately for server and client config by sergeybykov · Pull Request #2501 · temporalio/temporal · GitHub that I believe should fix this.

Hi @SergeyBykov

Still I can see the same issue after configure forceTLS: true.
Is there any code changes with v1.14.4 version?

tls:
frontend:
client:
forceTLS: true
rootCaFiles:
- “##CA_CERT_FILE##”

@jaffarsadik The PR I submitted yesterday is still open.

I have updated the your PR code in my source code & development.yaml file and started the server, currently frontend doesn’t have any issue. But worker pod keep crashed with the below error

Note: I would recommend to update the documentation for the new fields introduced in yaml.

@SergeyBykov - I am trying to deploy v1.13.0 version also facing worker pods crashing with this error.

13:31:59 Loading config files=[config/base.yaml config/development.yaml]
[Fx] SUPPLY *resource.BootstrapParams
[Fx] SUPPLY chan struct {}
[Fx] PROVIDE *client.factoryImpl <= go.temporal.io/server/common/persistence/client.NewFactoryImplProvider()
[Fx] PROVIDE client.Factory <= go.temporal.io/server/common/persistence/client.BindFactory()
[Fx] PROVIDE resource.SnTaggedLogger <= go.temporal.io/server/common/resource.SnTaggedLoggerProvider()
[Fx] PROVIDE resource.ThrottledLogger <= go.temporal.io/server/common/resource.ThrottledLoggerProvider()
[Fx] PROVIDE *config.Persistence <= go.temporal.io/server/common/resource.PersistenceConfigProvider()
[Fx] PROVIDE tally.Scope <= go.temporal.io/server/common/resource.MetricsScopeProvider()
[Fx] PROVIDE resource.HostName <= go.temporal.io/server/common/resource.HostNameProvider()
[Fx] PROVIDE resource.ServiceName <= go.temporal.io/server/common/resource.ServiceNameProvider()
[Fx] PROVIDE cluster.Metadata <= go.temporal.io/server/common/resource.ClusterMetadataProvider()
[Fx] PROVIDE *config.ClusterMetadata <= go.temporal.io/server/common/resource.ClusterMetadataConfigProvider()
[Fx] PROVIDE clock.TimeSource <= go.temporal.io/server/common/resource.TimeSourceProvider()
[Fx] PROVIDE persistence.ClusterMetadataManager <= go.temporal.io/server/common/resource.ClusterMetadataManagerProvider()
[Fx] PROVIDE resolver.ServiceResolver <= go.temporal.io/server/common/resource.PersistenceServiceResolverProvider()
[Fx] PROVIDE client.AbstractDataStoreFactory <= go.temporal.io/server/common/resource.AbstractDatastoreFactoryProvider()
[Fx] PROVIDE client.ClusterName <= go.temporal.io/server/common/resource.ClusterNameProvider()
[Fx] PROVIDE metrics.Client <= go.temporal.io/server/common/resource.MetricsClientProvider()
[Fx] PROVIDE searchattribute.Provider <= go.temporal.io/server/common/resource.SearchAttributeProviderProvider()
[Fx] PROVIDE searchattribute.Manager <= go.temporal.io/server/common/resource.SearchAttributeManagerProvider()
[Fx] PROVIDE searchattribute.Mapper <= go.temporal.io/server/common/resource.SearchAttributeMapperProvider()
[Fx] PROVIDE persistence.MetadataManager <= go.temporal.io/server/common/resource.MetadataManagerProvider()
[Fx] PROVIDE namespace.Registry <= go.temporal.io/server/common/resource.NamespaceCacheProvider()
[Fx] PROVIDE serialization.Serializer <= go.temporal.io/server/common/persistence/serialization.NewSerializer()
[Fx] PROVIDE archiver.ArchivalMetadata <= go.temporal.io/server/common/resource.ArchivalMetadataProvider()
[Fx] PROVIDE provider.ArchiverProvider <= go.temporal.io/server/common/resource.ArchiverProviderProvider()
[Fx] PROVIDE *archiver.HistoryBootstrapContainer <= go.temporal.io/server/common/resource.HistoryBootstrapContainerProvider()
[Fx] PROVIDE *archiver.VisibilityBootstrapContainer <= go.temporal.io/server/common/resource.VisibilityBootstrapContainerProvider()
[Fx] PROVIDE client.Bean <= go.temporal.io/server/common/resource.PersistenceBeanProvider()
[Fx] PROVIDE resource.MembershipMonitorFactory <= go.temporal.io/server/common/resource.MembershipFactoryProvider()
[Fx] PROVIDE membership.Monitor <= go.temporal.io/server/common/resource.MembershipMonitorProvider()
[Fx] PROVIDE client.FactoryProvider <= go.temporal.io/server/common/resource.ClientFactoryProvider()
[Fx] PROVIDE client.Bean <= go.temporal.io/server/common/resource.ClientBeanProvider()
[Fx] PROVIDE client.Client <= go.temporal.io/server/common/resource.SdkClientProvider()
[Fx] PROVIDE workflowservice.WorkflowServiceClient <= go.temporal.io/server/common/resource.FrontedClientProvider()
[Fx] PROVIDE *client.FaultInjectionDataStoreFactory <= go.temporal.io/server/common/resource.PersistenceFaultInjectionFactoryProvider()
[Fx] PROVIDE net.Listener <= go.temporal.io/server/common/resource.GrpcListenerProvider()
[Fx] PROVIDE resource.InstanceID <= go.temporal.io/server/common/resource.InstanceIDProvider()
[Fx] PROVIDE *tchannel.Channel <= go.temporal.io/server/common/resource.RingpopChannelProvider()
[Fx] PROVIDE *metrics.RuntimeMetricsReporter <= go.temporal.io/server/common/resource.RuntimeMetricsReporterProvider()
[Fx] PROVIDE resource.Resource <= go.temporal.io/server/common/resource.NewFromDI()
[Fx] PROVIDE log.Logger <= go.temporal.io/server/service/worker.ParamsExpandProvider()
[Fx] PROVIDE client.Client <= go.temporal.io/server/service/worker.ParamsExpandProvider()
[Fx] PROVIDE common.RPCFactory <= go.temporal.io/server/service/worker.ParamsExpandProvider()
[Fx] PROVIDE dynamicconfig.Client <= go.temporal.io/server/service/worker.ParamsExpandProvider()
[Fx] PROVIDE *dynamicconfig.Collection <= go.temporal.io/server/common/dynamicconfig.NewCollection()
[Fx] PROVIDE resource.ThrottledLoggerRpsFn <= go.temporal.io/server/service/worker.ThrottledLoggerRpsFnProvider()
[Fx] PROVIDE *worker.Config <= go.temporal.io/server/service/worker.NewConfig()
[Fx] PROVIDE client.PersistenceMaxQps <= go.temporal.io/server/service/worker.PersistenceMaxQpsProvider()
[Fx] PROVIDE *worker.Service <= go.temporal.io/server/service/worker.NewService()
[Fx] PROVIDE fx.Lifecycle <= go.uber.org/fx.New.func1()
[Fx] PROVIDE fx.Shutdowner <= go.uber.org/fx.(*App).shutdowner-fm()
[Fx] PROVIDE fx.DotGraph <= go.uber.org/fx.(*App).dotGraph-fm()
[Fx] INVOKE go.temporal.io/server/common/resource.RegisterBootstrapContainer()
[Fx] INVOKE go.temporal.io/server/service/worker.ServiceLifetimeHooks()
[Fx] HOOK OnStart go.temporal.io/server/service/worker.ServiceLifetimeHooks.func1() executing (caller: go.temporal.io/server/service/worker.ServiceLifetimeHooks)
[Fx] HOOK OnStart go.temporal.io/server/service/worker.ServiceLifetimeHooks.func1() called by go.temporal.io/server/service/worker.ServiceLifetimeHooks ran successfully in 9.677µs
[Fx] RUNNING
{“level”:“fatal”,“ts”:“2022-02-16T13:32:11.561Z”,“msg”:“error starting scanner”,“service”:“worker”,“error”:“context deadline exceeded”,“logging-call-at”:“service.go:233”,“stacktrace”:“go.temporal.io/server/common/log.(*zapLogger).Fatal\n\t/temporal/common/log/zap_logger.go:150\ngo.temporal.io/server/service/worker.(*Service).startScanner\n\t/temporal/service/worker/service.go:233\ngo.temporal.io/server/service/worker.(*Service).Start\n\t/temporal/service/worker/service.go:153\ngo.temporal.io/server/service/worker.ServiceLifetimeHooks.func1.1\n\t/temporal/service/worker/fx.go:80”}

Version: v1.14.4

{“level”:“fatal”,“ts”:“2022-02-16T10:00:05.242Z”,“msg”:“error starting scanner”,“service”:“worker”,“error”:“context deadline exceeded”,“logging-call-at”:“service.go:233”,“stacktrace”:“go.temporal.io/server/common/log.(*zapLogger).Fatal\n\t/temporal/common/log/zap_logger.go:150\ngo.temporal.io/server/service/worker.(*Service).startScanner\n\t/temporal/service/worker/service.go:233\ngo.temporal.io/server/service/worker.(*Service).Start\n\t/temporal/service/worker/service.go:153\ngo.temporal.io/server/service/worker.ServiceLifetimeHooks.func1.1\n\t/temporal/service/worker/fx.go:80”}

@jaffarsadik Do you use SystemWorker to configure the worker role’s TLS?

I dont have such configuration.

global:
tls:
frontend:
client:
rootCaFiles:
- “##CA_CERT_FILE##”

Looks like we are still missing docs for SystemWorker. I’ll check on that. It’s a separate setting for explicitly configuring workers independently from the frontend configuration settings. There’s an example of using it here.

@SergeyBykov - I have configured the tls settings and getting same error.
All 4 temporal services running on the same cluster & namespace.

tls:
frontend:
client:
forceTLS: true
rootCaFiles:
- “##CA_CERT_FILE##”
systemWorker:
certFile: /etc/temporal/certs/dev/cluster-internode.pem
keyFile: /etc/temporal/certs/dev/cluster-internode.key
client:
serverName: “frontend:7233”
rootCaFiles:
- “##CA_CERT_FILE##”

Error:

{“level”:“fatal”,“ts”:“2022-02-23T16:41:16.136Z”,“msg”:“error starting scanner”,“service”:“worker”,“error”:“context deadline exceeded”,“logging-call-at”:“service.go:432”,“stacktrace”:“go.temporal.io/server/common/log.(*zapLogger).Fatal\n\t/temporal/common/log/zap_logger.go:150\ngo.temporal.io/server/service/worker.(*Service).startScanner\n\t/temporal/service/worker/service.go:432\ngo.temporal.io/server/service/worker.(*Service).Start\n\t/temporal/service/worker/service.go:340\ngo.temporal.io/server/service/worker.ServiceLifetimeHooks.func1.1\n\t/temporal/service/worker/fx.go:79”}

@SergeyBykov - any updates on above.

@maxim - can you please help me here, i am struck on this space.

@jaffarsadik can you pull this PR to see if it will resolve the issue?

@jaffarsadik this pr has been merged already, and will be part of 1.16 release. It would help if you could confirm that it solves your problem. Thanks.