Getting `GetOrCreateShard: failed to get ShardID` in temporal-history, and Deadline Exceeded

karelbilek · May 5, 2023, 2:47pm

We today got a bunch of GetOrCreateShard: failed to get ShardID in logs in temporal-history (and by “bunch”, I mean around 40 thousands of these errors) and at the same time, we could not create new workflows and got Deadline Exceeded on temporal client.

We are using temporal 1.18. We use 512 shards. (That might be too much as we get just a few workers so far, as it’s a testing environment, but, eh. It was the default.)

Exactly what is described here

Context deadline exceeded issue - Community Support - Temporal

error.",“error”:“GetVisibilityTasks operation failed. Select failed.
error.”,“error”:“UpdateShard failed. Failed to start transaction.
error.”,“error”:"GetOrCreateShard: failed to get ShardID 177

We are now trying to figure out what is wrong.

We are using postgres for visibility store, ES for advanced visibility features.

The main question I have - Are those errors caused by ES or by Postgres? I cannot figure that out.

karelbilek · May 5, 2023, 3:10pm

Ah, by looking at logs of elasticsearch, they are in some weird half-broken state.

So for anyone else looking at this - it was caused by elasticsearch

karelbilek · May 5, 2023, 8:25pm

Looking more into it

It was definitely caused by Helm chart. We are using this helm chart

temporalio/helm-charts: Temporal Helm charts (github.com)

Which has this written there:

The only portions of the helm chart that are production ready are the parts that configure and manage Temporal Server itself—not Cassandra, Elasticsearch, Prometheus, or Grafana.

But we were using the ES created by the Temporal helm chart. And looking at the other cases of this error, people were also using Helm charts, so it is 100% something with that. We will be moving out of the Temporal helm-chart-made ES.

karelbilek · October 18, 2023, 10:34am

Just replying to my own old thread.

It was just caused by ElasticSearch running out of memory, probably.

We (hopefully) fixed it by giving it more memory and setting the replicas better.

Sakshi1 · June 20, 2024, 7:03am

@karelbilek We are also facing the same issue, Can you please elaborate a little moe on how you resolved this issue. We are running temporal on self managed clusters. We are also using Elastic search created by our helm charts.
Thanks !

karelbilek · June 20, 2024, 9:00am

Unfortunately I don’t remember, as I don’t work there anymore.

We solved it by setting up a separate ElasticSearch k8s clusters that were more “beefy” - with more memory, and more than one replica.

I remember by FAR the most costy thing of all our kubernetes - by memory at least, but I think by other means - was then ElasticSearch of Temporal. It out-costed the entire rest of the application.

(We were thinking about moving out of ElasticSearch Advanced Visibility and moving to the Postgres one, but we never figured all that out, as it was not yet ready in May 2023.)

Topic		Replies	Views
Unable to create history shard context Error Message Community Support helm , deployment	1	594	June 3, 2022
Temporal 1.9.2 not working with Elasticsearch 7.7 - Error 400 All Shards Failed Community Support elasticsearch	3	3187	July 21, 2021
Elasticsearch shard is shut down Community Support elasticsearch	0	206	February 1, 2024
visibilityMetricsClient exceptions in temporal-frontend pod Community Support elasticsearch , metrics , advanced_visibility , visibility , kubernetes	7	997	April 28, 2021
Temporal production deployment stopped working Community Support java-sdk , helm	7	1024	January 15, 2023

Getting `GetOrCreateShard: failed to get ShardID` in temporal-history, and Deadline Exceeded

Related topics