Getting `GetOrCreateShard: failed to get ShardID` in temporal-history, and Deadline Exceeded

We today got a bunch of GetOrCreateShard: failed to get ShardID in logs in temporal-history (and by “bunch”, I mean around 40 thousands of these errors) and at the same time, we could not create new workflows and got Deadline Exceeded on temporal client.

We are using temporal 1.18. We use 512 shards. (That might be too much as we get just a few workers so far, as it’s a testing environment, but, eh. It was the default.)

Exactly what is described here

Context deadline exceeded issue - Community Support - Temporal

error.",“error”:“GetVisibilityTasks operation failed. Select failed.
error.”,“error”:“UpdateShard failed. Failed to start transaction.
error.”,“error”:"GetOrCreateShard: failed to get ShardID 177

We are now trying to figure out what is wrong.

We are using postgres for visibility store, ES for advanced visibility features.

The main question I have - Are those errors caused by ES or by Postgres? I cannot figure that out.

Ah, by looking at logs of elasticsearch, they are in some weird half-broken state.

So for anyone else looking at this - it was caused by elasticsearch

Looking more into it

It was definitely caused by Helm chart. We are using this helm chart

temporalio/helm-charts: Temporal Helm charts (github.com)

Which has this written there:

The only portions of the helm chart that are production ready are the parts that configure and manage Temporal Server itself—not Cassandra, Elasticsearch, Prometheus, or Grafana.

But we were using the ES created by the Temporal helm chart. And looking at the other cases of this error, people were also using Helm charts, so it is 100% something with that. We will be moving out of the Temporal helm-chart-made ES.

Just replying to my own old thread.

It was just caused by ElasticSearch running out of memory, probably.

We (hopefully) fixed it by giving it more memory and setting the replicas better.