Continues visibility error spam from finished workflows: "Unable to decode search attributes: invalid search attribute type: Unspecified"

Hello

We’re getting continous error logs from a couple of hundred completed workflows, after upgrading from Temporal 1.13 (->.14, etc) to 1.18.5.

    "component": "visibility-queue-processor",
    "error": "Unable to decode search attributes: invalid search attribute type: Unspecified",
    "lifecycle": "ProcessingFailed",
   ... data about the namespace, workflow and run...

We use Elasticsearch for visibility and the workflows have custom search attributes

>> tctl cluster get-search-attributes
Search attributes:
+----------------------------+----------+
|            NAME            |   TYPE   |
+----------------------------+----------+
...
| subscription_state         | Keyword  |
| transaction_id             | Keyword  |
| user_id                    | Keyword  |
+----------------------------+----------+

Elasticsearch data:

    ...
    "transaction_id": "nbp.ps2.disconnect-vxse2-test-00038",
    "user_id": "SYSTEM"
  },

I cannot find any error with the custom search attributes or the Elasticsearch index.

I would like to find out what is happening, why and how I can purge these old workflows today.

Earlier, I asked about this in the Slack as well, without replies. I attached some data in that chat thread, which might be interesting: Slack

Can you provide more details how you upgraded?
When you upgraded to the next minor version (eg: 1.13 to 1.14), did you start the Temporal Server?
Was the upgrade from 1.17 to 1.18 the step that you started to see this error?

Details, of course!

We upgraded one minor at a time, starting the servers and letting them run for 10-15 minutes.

Yes, the errors started occuring after the bump (as far as I can tell, I’m looking back at logs here).
Looking at the History node log, there where a lot of “Error updating ack level for shard”, “Error updating timer ack level for shard” & “Critical attempts processing workflow task”-warnings. Perhaps not related.

We don´t have any schedules, so I ran the Elasticsearch upgrade script after updating.

We use the secondary Elasticsearch index as well, and it was not updated with the custom search attributes (tctl add search attribute does not touch it), so we added those in manually afterwards as well.

Just to confirm, you enabled dual write secondary Elasticsearch index (dynamic config system.enableWriteToSecondaryAdvancedVisibility = true) after the upgrade, right?

If so, you have to run tctl admin cluster add-search-attributes --index <YOUR_SECONDARY_INDEX> --skip-schema-update for your custom search attributes. Temporal Server keeps a record of the custom search attributes per index, and the tctl command to add search attributes makes sure that they are registered. Since you already added those to ES index, the flag --skip-schema-update will skip this step.

Thank you - adding the search attributes to the secondary index with tctl did solve this issue.

tctl admin cluster add-search-attributes \
  --index SEC_INDX --skip-schema-update \
  -n transaction_id -t Keyword

tctl admin cluster add-search-attributes \
  --index SEC_INDX --skip-schema-update \
  -n subscription_state -t Keyword

tctl admin cluster add-search-attributes \
  --index SEC_INDX --skip-schema-update \
  -n user_id -t Keyword

The errors have now stopped showing up in the logs!

I would like to request that Temporal Cluster deployment guide | Temporal Documentation is extended to include information on the secondary ES index support.

Thank you.