Frontend error: "shard status unknown"

Getting a lot of these errors on frontend service while stress testing via maru.

{"level":"info","ts":"2023-01-17T09:28:02.626Z","msg":"history client encountered error","service":"frontend","error":"shard status unknown","service-error-type":"serviceerror.Unavailable","logging-call-at":"metric_client.go:90"}

Along with similar errors which say timeout occurred during StartWorkflowTask etc.
Does this indicate history nodes need to be scaled up? I can see workflows being executed despite these errors…

Would check stability of your history hosts during load test. Server has a restarts counter metric that you could look at as well as service_errors_resource_exhausted that you can filter by operations (RpsLimit, ConcurrentLimit, SystemOverloaded). Monitoring you history service CPU utilization would be good as well to know if you are giving your history hosts enough resources.

Error is typically due to shard(s) getting unloaded (one reason could be a history host goes down) and then having to be rebalanced across other available history hosts until a new shard owner is determined.

I can see workflows being executed despite these errors…

Yeah this should in most cases not affect your workflow executions (they would be able to make progress once a new shard owner is determined) but you could see increased latencies in persistence during that time.