Segmentation violation during temporal upgrades

Hello, we’re trying to upgrade our temporal services to the latest version.
We started at 1.0 and everything went fine up to 1.5.7.

At that point we could not upgrade with helm anymore. The containers are stuck in crashloop due these panics:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x17f760a]

goroutine 1 [running]:
go.temporal.io/server/common/persistence.(*clusterMetadataManagerImpl).GetClusterMetadata(0xc0000d2390, 0xc000687280, 0x1, 0xc0006fcc20)
	/temporal/common/persistence/clusterMetadataStore.go:115 +0xaa
go.temporal.io/server/common/persistence.(*clusterMetadataManagerImpl).SaveClusterMetadata(0xc0000d2390, 0xc000687280, 0x217a8c0, 0x373439332d383401, 0xc000687280)
	/temporal/common/persistence/clusterMetadataStore.go:123 +0x8d
go.temporal.io/server/common/persistence.(*clusterMetadataRateLimitedPersistenceClient).SaveClusterMetadata(0xc0000d23c0, 0xc000687280, 0xc0000d23c0, 0x0, 0x0)
	/temporal/common/persistence/persistenceRateLimitedClients.go:1008 +0x5f
go.temporal.io/server/temporal.(*Server).immutableClusterMetadataInitialization(0xc000074840, 0xc00008a510, 0x0, 0x0)
	/temporal/temporal/server.go:408 +0x339
go.temporal.io/server/temporal.(*Server).Start(0xc000074840, 0xc000074840, 0x5)
	/temporal/temporal/server.go:138 +0x805
main.buildCLI.func2(0xc000686980, 0x0, 0x0)
	/temporal/cmd/server/main.go:139 +0x5f7
github.com/urfave/cli/v2.(*Command).Run(0xc000228b40, 0xc000686780, 0x0, 0x0)
	/go/pkg/mod/github.com/urfave/cli/v2@v2.3.0/command.go:163 +0x4ed
github.com/urfave/cli/v2.(*App).RunContext(0xc0001111e0, 0x28303a0, 0xc000124010, 0xc000130000, 0x7, 0x7, 0x0, 0x0)
	/go/pkg/mod/github.com/urfave/cli/v2@v2.3.0/app.go:313 +0x81f
github.com/urfave/cli/v2.(*App).Run(...)
	/go/pkg/mod/github.com/urfave/cli/v2@v2.3.0/app.go:224
main.main()
	/temporal/cmd/server/main.go:47 +0x66

The panics are same on the services I checked, matching and frontend.

We have tried upgrading both to 1.6.5 and 1.7.1 but both share this issue.
Do you have any pointers as to what could be wrong?

seems that there is no cluster metadata record in DB

did you run the server with version <= 1.5.x for some time?
the cluster metadata record should be initialized if you run the server

The version deployed at the time was 1.5.7. Before that 1.4.4, both working fine with workflows running.

can you try

select * from cluster_metadata;

Unfortunately we had to drop the database to restore functionality on the test environment since we were not able to roll back from our 1.7.1 attempt.

On a clean database we were able to upgrade to 1.6.5.
We were rehearsing the upgrade for our production cluster which is still 1.0 (like this test one was).

Is there something we should look for in the query when we get there on production? Or do you have any idea how to prevent this from happening (e.g. running the previous version longer or inserting some metadata if it’s not there)?

internally, we have CICD pipelines tracking the compatibility of any 2 consecutive minor versions.

i guess the setup you have (the one which failed) does not run long enough, meaning
you did not wait long enough for 1.n.x to write down a record on cluster metadata table before moving to 1.n+1.x

specifically, there were some migration on cluster metadata table beginning from either 1.1.x or 1.2.x, which use 2 different fields data, data_encoding instead of old ones. the server will auto backfill the data, data_encoding 2 fields if serve can run long enough.

i would suggest first upgrade from 1.0.x to 1.3.x, then wait, then 1.4.x, 1.5.x, 1.6.x, 1.7.x, 1.8.x

Do you have any suggestions for what is “long enough”? Is it deterministic?
We had number of hours between the versions that succeeded and then ~10-15 minutes, including with new running workflows, between the versions that failed.

which version are you currently on?

select data, data_encoding from cluster_metadata;

does ^ returns anything?

Our test setup, which we upgraded to 1.6.5 after dropping the database, as described before) returns

metadata_partition,immutable_data,immutable_data_encoding,data,data_encoding,version
0,0x0A066163746976651080041A2436346639613036302D336132322D346432622D383430382D323264373461303931376330,Proto3,0x0A066163746976651080041A2436346639613036302D336132322D346432622D383430382D323264373461303931376330,Proto3,1

It has been on the 1.6.5 version for the past 5 days.

Our production server is still 1.0.0 (aiming for latest 1.8.1 if we manage to learn how to prevent the issue we encouraged on test) does not have these columns.

metadata_partition,immutable_data,immutable_data_encoding
0,0x0A066163746976651004,Proto3

Can we somehow verify the next upgrade will not cause the segmentation violation based on what is at the time in this table?

We have attempted the upgrade in our production from version 1.5.7 to 1.6.5 and got the same panic.
There were no db migrations in this step and the data in cluster_metadata table still matches what I have wrote before, no change since the 1.0.0.
At the moment we cannot proceed with updating.

I am not sure if this is relevant but now when I check the metadata with tctl I only get few fields, before I am sure it returned addresses and other information as well…

bash-5.0# tctl admin cluster metadata
{
  "supportedClients": {
    "temporal-cli": "\u003c2.0.0",
    "temporal-go": "\u003c2.0.0",
    "temporal-java": "\u003c2.0.0",
    "temporal-server": "\u003c2.0.0"
  },
  "serverVersion": "1.5.7",
  "clusterName": "active",
  "historyShardCount": 4
}

UPDATE:

the data in cluster_metadata table still matches what I have wrote before, no change since the 1.0.0.

you need to wait for DB to propagate the data, meaning

select data, data_encoding from cluster_metadata;

in you cluster should return something before upgrading.


about your test setup

Our test setup, which we upgraded to 1.6.5

does this successfully upgrade to 1.7.x?


@m_p
the information showed here looks really weird, can you join our public slack so we can talk easily?

Talked offline with customer

Supplied configuration key/value mismatches persisted ImmutableClusterMetadata.Continuing with the persisted value as this value cannot be changed once initialized.

issue caused by changing of the number of history service shards, making server side auto-migration of DB data (which includes the number of history shards) to fail

the solution is to make sure the number of history shards immutable (not changed compared to what DB says) and let server do the rest