Segmentation violation during temporal upgrades

m_p · April 1, 2021, 3:26pm

Hello, we’re trying to upgrade our temporal services to the latest version.
We started at 1.0 and everything went fine up to 1.5.7.

At that point we could not upgrade with helm anymore. The containers are stuck in crashloop due these panics:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x17f760a]

goroutine 1 [running]:
go.temporal.io/server/common/persistence.(*clusterMetadataManagerImpl).GetClusterMetadata(0xc0000d2390, 0xc000687280, 0x1, 0xc0006fcc20)
	/temporal/common/persistence/clusterMetadataStore.go:115 +0xaa
go.temporal.io/server/common/persistence.(*clusterMetadataManagerImpl).SaveClusterMetadata(0xc0000d2390, 0xc000687280, 0x217a8c0, 0x373439332d383401, 0xc000687280)
	/temporal/common/persistence/clusterMetadataStore.go:123 +0x8d
go.temporal.io/server/common/persistence.(*clusterMetadataRateLimitedPersistenceClient).SaveClusterMetadata(0xc0000d23c0, 0xc000687280, 0xc0000d23c0, 0x0, 0x0)
	/temporal/common/persistence/persistenceRateLimitedClients.go:1008 +0x5f
go.temporal.io/server/temporal.(*Server).immutableClusterMetadataInitialization(0xc000074840, 0xc00008a510, 0x0, 0x0)
	/temporal/temporal/server.go:408 +0x339
go.temporal.io/server/temporal.(*Server).Start(0xc000074840, 0xc000074840, 0x5)
	/temporal/temporal/server.go:138 +0x805
main.buildCLI.func2(0xc000686980, 0x0, 0x0)
	/temporal/cmd/server/main.go:139 +0x5f7
github.com/urfave/cli/v2.(*Command).Run(0xc000228b40, 0xc000686780, 0x0, 0x0)
	/go/pkg/mod/github.com/urfave/cli/v2@v2.3.0/command.go:163 +0x4ed
github.com/urfave/cli/v2.(*App).RunContext(0xc0001111e0, 0x28303a0, 0xc000124010, 0xc000130000, 0x7, 0x7, 0x0, 0x0)
	/go/pkg/mod/github.com/urfave/cli/v2@v2.3.0/app.go:313 +0x81f
github.com/urfave/cli/v2.(*App).Run(...)
	/go/pkg/mod/github.com/urfave/cli/v2@v2.3.0/app.go:224
main.main()
	/temporal/cmd/server/main.go:47 +0x66

The panics are same on the services I checked, matching and frontend.

We have tried upgrading both to 1.6.5 and 1.7.1 but both share this issue.
Do you have any pointers as to what could be wrong?

Wenquan_Xing · April 1, 2021, 5:48pm

seems that there is no cluster metadata record in DB

did you run the server with version <= 1.5.x for some time?
the cluster metadata record should be initialized if you run the server

m_p · April 1, 2021, 6:06pm

The version deployed at the time was 1.5.7. Before that 1.4.4, both working fine with workflows running.

Wenquan_Xing · April 1, 2021, 6:38pm

can you try

select * from cluster_metadata;

m_p · April 1, 2021, 7:29pm

Unfortunately we had to drop the database to restore functionality on the test environment since we were not able to roll back from our 1.7.1 attempt.

On a clean database we were able to upgrade to 1.6.5.
We were rehearsing the upgrade for our production cluster which is still 1.0 (like this test one was).

Is there something we should look for in the query when we get there on production? Or do you have any idea how to prevent this from happening (e.g. running the previous version longer or inserting some metadata if it’s not there)?

Wenquan_Xing · April 1, 2021, 9:12pm

internally, we have CICD pipelines tracking the compatibility of any 2 consecutive minor versions.

i guess the setup you have (the one which failed) does not run long enough, meaning
you did not wait long enough for 1.n.x to write down a record on cluster metadata table before moving to 1.n+1.x

specifically, there were some migration on cluster metadata table beginning from either 1.1.x or 1.2.x, which use 2 different fields data, data_encoding instead of old ones. the server will auto backfill the data, data_encoding 2 fields if serve can run long enough.

i would suggest first upgrade from 1.0.x to 1.3.x, then wait, then 1.4.x, 1.5.x, 1.6.x, 1.7.x, 1.8.x

m_p · April 6, 2021, 7:46am

Do you have any suggestions for what is “long enough”? Is it deterministic?
We had number of hours between the versions that succeeded and then ~10-15 minutes, including with new running workflows, between the versions that failed.

Wenquan_Xing · April 6, 2021, 4:12pm

which version are you currently on?

select data, data_encoding from cluster_metadata;

does ^ returns anything?

m_p · April 7, 2021, 6:34am

Our test setup, which we upgraded to 1.6.5 after dropping the database, as described before) returns

metadata_partition,immutable_data,immutable_data_encoding,data,data_encoding,version
0,0x0A066163746976651080041A2436346639613036302D336132322D346432622D383430382D323264373461303931376330,Proto3,0x0A066163746976651080041A2436346639613036302D336132322D346432622D383430382D323264373461303931376330,Proto3,1

It has been on the 1.6.5 version for the past 5 days.

Our production server is still 1.0.0 (aiming for latest 1.8.1 if we manage to learn how to prevent the issue we encouraged on test) does not have these columns.

metadata_partition,immutable_data,immutable_data_encoding
0,0x0A066163746976651004,Proto3

Can we somehow verify the next upgrade will not cause the segmentation violation based on what is at the time in this table?

m_p · April 9, 2021, 1:29pm

We have attempted the upgrade in our production from version 1.5.7 to 1.6.5 and got the same panic.
There were no db migrations in this step and the data in cluster_metadata table still matches what I have wrote before, no change since the 1.0.0.
At the moment we cannot proceed with updating.

I am not sure if this is relevant but now when I check the metadata with tctl I only get few fields, before I am sure it returned addresses and other information as well…

bash-5.0# tctl admin cluster metadata
{
  "supportedClients": {
    "temporal-cli": "\u003c2.0.0",
    "temporal-go": "\u003c2.0.0",
    "temporal-java": "\u003c2.0.0",
    "temporal-server": "\u003c2.0.0"
  },
  "serverVersion": "1.5.7",
  "clusterName": "active",
  "historyShardCount": 4
}

Wenquan_Xing · April 9, 2021, 10:43pm

UPDATE:

the data in cluster_metadata table still matches what I have wrote before, no change since the 1.0.0.

you need to wait for DB to propagate the data, meaning

select data, data_encoding from cluster_metadata;

in you cluster should return something before upgrading.

about your test setup

Our test setup, which we upgraded to 1.6.5

does this successfully upgrade to 1.7.x?

@m_p
the information showed here looks really weird, can you join our public slack so we can talk easily?

Wenquan_Xing · April 13, 2021, 6:30pm

Talked offline with customer

Supplied configuration key/value mismatches persisted ImmutableClusterMetadata.Continuing with the persisted value as this value cannot be changed once initialized.

issue caused by changing of the number of history service shards, making server side auto-migration of DB data (which includes the number of history shards) to fail

the solution is to make sure the number of history shards immutable (not changed compared to what DB says) and let server do the rest

Topic		Replies	Views
Temporal installation Community Support general-impl	0	65	July 4, 2024
SIGSEGV in temporal-history service Community Support java-sdk , server	2	286	January 29, 2024
K8s readOnlyRootFilesystem Community Support	2	469	March 20, 2024
Temporal-server v1.18.0 compilation error Community Support server	2	802	September 28, 2022
Temporal test framework obscures stack trace for underlying error Community Support go-sdk , testing	5	91	October 11, 2024

Segmentation violation during temporal upgrades

Related topics