We today got a bunch of GetOrCreateShard: failed to get ShardID
in logs in temporal-history (and by “bunch”, I mean around 40 thousands of these errors) and at the same time, we could not create new workflows and got Deadline Exceeded
on temporal client.
We are using temporal 1.18. We use 512 shards. (That might be too much as we get just a few workers so far, as it’s a testing environment, but, eh. It was the default.)
Exactly what is described here
Context deadline exceeded issue - Community Support - Temporal
error.",“error”:“GetVisibilityTasks operation failed. Select failed.
error.”,“error”:“UpdateShard failed. Failed to start transaction.
error.”,“error”:"GetOrCreateShard: failed to get ShardID 177
We are now trying to figure out what is wrong.
We are using postgres for visibility store, ES for advanced visibility features.
The main question I have - Are those errors caused by ES or by Postgres? I cannot figure that out.