Operation updateShard encounter timeout

hi,

we are facing Workflow timeouts and are unable to spawn childworkflows and sometimes even the activities. Our history nodes are having below errors. Appreciate any suggestions

{“level”:“error”,“ts”:“2022-06-16T18:24:47.075Z”,“msg”:“Fail to process task”,“service”:“history”,“shard-id”:4001,“address”:“x.x.x.x:7234”,“shard-item”:“0xc00c4ff000”,“component”:“transfer-queue-processor”,“cluster-name”:“active”,“shard-id”:4001,“queue-task-id”:32505900,“queue-task-visibility-timestamp”:1655403666644602000,“xdc-failover-version”:0,“queue-task-type”:“TransferStartChildExecution”,“wf-namespace-id”:“fd9975c9-6798-406c-978d-d9ac0af87dd1”,“wf-id”:“EF632AA469B04BB8829137BC00AF64CA@AVABCgA”,“wf-run-id”:“22b66829-0cda-4992-ac97-78eedbde3894”,“error”:“operation UpdateShard encounter Operation timed out - received only 1 responses.”,“lifecycle”:“ProcessingFailed”,“logging-call-at”:“taskProcessor.go:326”,“stacktrace”:“go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:136\ngo.temporal.io/server/service/history.(*taskProcessor).handleTaskError\n\t/temporal/service/history/taskProcessor.go:326\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck.func1\n\t/temporal/service/history/taskProcessor.go:212\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:103\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck\n\t/temporal/service/history/taskProcessor.go:238\ngo.temporal.io/server/service/history.(*taskProcessor).taskWorker\n\t/temporal/service/history/taskProcessor.go:161”}

What is your configured cassandra replication factor?
Do you see any errors in db logs?
Can you also check that system time is in sync between your cassandra nodes?

3 nodes, rf=1.
On checking the cassandra logs, we see high latency at both LOCAL_SERIAL and LOCAL_QUORUM. Our DBA has also raised red flags on

  1. usage of secondary indexes
  2. queue like workload as this is anti pattern in cassandra.

We had a node down 2 days backs and our history services were not starting up throwing connection timeout.

  1. Secondary indexes are used only for visibility. For high throughput use cases don’t use Cassandra for visibility. Use Elastic Search integration instead.
  2. The Temporal implements queues by using immutable append only tables. This works fine. The anti-pattern is a naive implementation that mutates messages.

3 nodes, rf=1

You should set replication factor to 3 to prevent data loss

What is the strategy used? NetworkTopologyStrategy?

We had a node down 2 days backs and our history services were not starting up throwing connection timeout.

I think this too can be related to the replication factor used. If you use 3 you should be able to cope with one node being down/unavailable for most operations.

@tihomir

Not quite following this statement. If RF=1, it means we should be able to cope with one node being down/unavailable right? If this is not true for Temporal, can you pls point me to the doc with recommended settings. I believe we are using the helm chart for deployment (https://gecgithub01.walmart.com/Ingestion-Services/item-ingestion-syncapi/pull/93)

thanks @maxim. Will try the elastic search integration and will also respond back to our DBA on queue workload.

Not quite following this statement. If RF=1, it means we should be able to cope with one node being down/unavailable right?

Don’t think this is the case, see for example here.
Replication factory of 3 is recommended for production, more info here.

@tihomir We updated the RF to 3 but still see this issue.
One more update from DBA which we were unaware of is that there is a backup cluster in another region and some of the writes/reads are happening cross region. DBA is asking us to look at driver configuration . Can you pls help us point to the config and how to stop these cross region requests.

We had to migrate the temporal server to new C* cluster as multiple C* clusters are not recommended for Temporal. After this migration we do not see any such issues