Error while fetching cluster metadata

We are trying to deploy Temporal using the helm chart, using a MySQL as its persistence store. We’ve successfully deployed the helm chart both in development and on an AWS cluster but on an Azure cluster we are seeing the following error, on all services, leading all pods to go into a crashloopbackoff loop.

{"level":"warn","ts":"2022-08-22T23:36:08.421Z","msg":"Failed to save cluster metadata.","component":"metadata-initializer","error":"proto: ClusterMetadata: wiretype end group for non-group","cluster-name":"active","logging-call-at":"fx.go:628"}

Unable to start server. Error: could not build arguments for function "go.temporal.io/server/temporal".ServerLifetimeHooks (/home/builder/temporal/temporal/fx.go:738): failed to build temporal.Server: could not build arguments for function "go.temporal.io/server/temporal".glob..func1 (/home/builder/temporal/temporal/server_impl.go:65): failed to build *temporal.ServerImpl: could not build arguments for function "go.temporal.io/server/temporal".NewServerFxImpl (/home/builder/temporal/temporal/server_impl.go:69): could not build value group *temporal.ServicesMetadata[group="services"]: could not build arguments for function "go.temporal.io/server/temporal".HistoryServiceProvider (/home/builder/temporal/temporal/fx.go:334): failed to build config.Persistence: received non-nil error from function "go.temporal.io/server/temporal".ApplyClusterMetadataConfigProvider (/home/builder/temporal/temporal/fx.go:563): error while fetching cluster metadata: proto: ClusterMetadata: wiretype end group for non-group

Given that the same helm chart works on one kubernetes environment and not another, I’m guessing there must be some configuration issue though it’s hard to tell from the error what configuration that might be.

Are you deploying the same server version and if so which one is it?
With 1.14 server release cluster metadata was moved and loaded from dynamic config rather than static config, wondering if that could be the case here.

Yes, the same server is deployed on all environments. We are using the latest helm chart, 1.17.4.

Can you share some details on how the dynamic config for cluster metadata works? We are not using multi-cluster replication. I see that there’s a cluster_metadata table but it’s empty on all of our environments.

I spent a bit more time debugging this issue and managed to reproduce it in a local environment by doing the following:

  1. Exported the temporal and temporal_visibility DBs from the problematic cluster and imported them to a local DB instance.
  2. Deployed the same Temporal helm chart to a local kube cluster and pointed that to the copied databases.

My understanding from the error is that cluster metadata cannot be unmarshaled properly and is possibly corrupted. I was able to resolve the issue in my local environment by deleting the single row in temporal.cluster_metadata_info . The row got recreated by the application and the server proceeded to start normally. I tried doing the same on the Azure cluster but the issue persisted. The row got recreated but that data still seems to be corrupted.

Any ideas on what might be causing the cluster metadata to be corrupted?