Unfortunately we had to drop the database to restore functionality on the test environment since we were not able to roll back from our 1.7.1 attempt.
On a clean database we were able to upgrade to 1.6.5.
We were rehearsing the upgrade for our production cluster which is still 1.0 (like this test one was).
Is there something we should look for in the query when we get there on production? Or do you have any idea how to prevent this from happening (e.g. running the previous version longer or inserting some metadata if it’s not there)?
internally, we have CICD pipelines tracking the compatibility of any 2 consecutive minor versions.
i guess the setup you have (the one which failed) does not run long enough, meaning
you did not wait long enough for 1.n.x to write down a record on cluster metadata table before moving to 1.n+1.x
specifically, there were some migration on cluster metadata table beginning from either 1.1.x or 1.2.x, which use 2 different fields data, data_encoding instead of old ones. the server will auto backfill the data, data_encoding 2 fields if serve can run long enough.
i would suggest first upgrade from 1.0.x to 1.3.x, then wait, then 1.4.x, 1.5.x, 1.6.x, 1.7.x, 1.8.x
Do you have any suggestions for what is “long enough”? Is it deterministic?
We had number of hours between the versions that succeeded and then ~10-15 minutes, including with new running workflows, between the versions that failed.
It has been on the 1.6.5 version for the past 5 days.
Our production server is still 1.0.0 (aiming for latest 1.8.1 if we manage to learn how to prevent the issue we encouraged on test) does not have these columns.
We have attempted the upgrade in our production from version 1.5.7 to 1.6.5 and got the same panic.
There were no db migrations in this step and the data in cluster_metadata table still matches what I have wrote before, no change since the 1.0.0.
At the moment we cannot proceed with updating.
I am not sure if this is relevant but now when I check the metadata with tctl I only get few fields, before I am sure it returned addresses and other information as well…
Supplied configuration key/value mismatches persisted ImmutableClusterMetadata.Continuing with the persisted value as this value cannot be changed once initialized.
issue caused by changing of the number of history service shards, making server side auto-migration of DB data (which includes the number of history shards) to fail
the solution is to make sure the number of history shards immutable (not changed compared to what DB says) and let server do the rest