Getting `GetOrCreateShard: failed to get ShardID` in temporal-history, and Deadline Exceeded

We today got a bunch of GetOrCreateShard: failed to get ShardID in logs in temporal-history (and by “bunch”, I mean around 40 thousands of these errors) and at the same time, we could not create new workflows and got Deadline Exceeded on temporal client.

We are using temporal 1.18. We use 512 shards. (That might be too much as we get just a few workers so far, as it’s a testing environment, but, eh. It was the default.)

Exactly what is described here

Context deadline exceeded issue - Community Support - Temporal

error.",“error”:“GetVisibilityTasks operation failed. Select failed.
error.”,“error”:“UpdateShard failed. Failed to start transaction.
error.”,“error”:"GetOrCreateShard: failed to get ShardID 177

We are now trying to figure out what is wrong.

We are using postgres for visibility store, ES for advanced visibility features.

The main question I have - Are those errors caused by ES or by Postgres? I cannot figure that out.

Ah, by looking at logs of elasticsearch, they are in some weird half-broken state.

So for anyone else looking at this - it was caused by elasticsearch

Looking more into it

It was definitely caused by Helm chart. We are using this helm chart

temporalio/helm-charts: Temporal Helm charts (github.com)

Which has this written there:

The only portions of the helm chart that are production ready are the parts that configure and manage Temporal Server itself—not Cassandra, Elasticsearch, Prometheus, or Grafana.

But we were using the ES created by the Temporal helm chart. And looking at the other cases of this error, people were also using Helm charts, so it is 100% something with that. We will be moving out of the Temporal helm-chart-made ES.

Just replying to my own old thread.

It was just caused by ElasticSearch running out of memory, probably.

We (hopefully) fixed it by giving it more memory and setting the replicas better.

@karelbilek We are also facing the same issue, Can you please elaborate a little moe on how you resolved this issue. We are running temporal on self managed clusters. We are also using Elastic search created by our helm charts.
Thanks !

Unfortunately I don’t remember, as I don’t work there anymore.

We solved it by setting up a separate ElasticSearch k8s clusters that were more “beefy” - with more memory, and more than one replica.

I remember by FAR the most costy thing of all our kubernetes - by memory at least, but I think by other means - was then ElasticSearch of Temporal. It out-costed the entire rest of the application.

(We were thinking about moving out of ElasticSearch Advanced Visibility and moving to the Postgres one, but we never figured all that out, as it was not yet ready in May 2023.)

1 Like