Operation updateShard encounter timeout

ravikiran · June 16, 2022, 6:32pm

hi,

we are facing Workflow timeouts and are unable to spawn childworkflows and sometimes even the activities. Our history nodes are having below errors. Appreciate any suggestions

{“level”:“error”,“ts”:“2022-06-16T18:24:47.075Z”,“msg”:“Fail to process task”,“service”:“history”,“shard-id”:4001,“address”:“x.x.x.x:7234”,“shard-item”:“0xc00c4ff000”,“component”:“transfer-queue-processor”,“cluster-name”:“active”,“shard-id”:4001,“queue-task-id”:32505900,“queue-task-visibility-timestamp”:1655403666644602000,“xdc-failover-version”:0,“queue-task-type”:“TransferStartChildExecution”,“wf-namespace-id”:“fd9975c9-6798-406c-978d-d9ac0af87dd1”,“wf-id”:“EF632AA469B04BB8829137BC00AF64CA@AVABCgA”,“wf-run-id”:“22b66829-0cda-4992-ac97-78eedbde3894”,“error”:“operation UpdateShard encounter Operation timed out - received only 1 responses.”,“lifecycle”:“ProcessingFailed”,“logging-call-at”:“taskProcessor.go:326”,“stacktrace”:“go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:136\ngo.temporal.io/server/service/history.(*taskProcessor).handleTaskError\n\t/temporal/service/history/taskProcessor.go:326\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck.func1\n\t/temporal/service/history/taskProcessor.go:212\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:103\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck\n\t/temporal/service/history/taskProcessor.go:238\ngo.temporal.io/server/service/history.(*taskProcessor).taskWorker\n\t/temporal/service/history/taskProcessor.go:161”}

tihomir · June 16, 2022, 8:16pm

What is your configured cassandra replication factor?
Do you see any errors in db logs?
Can you also check that system time is in sync between your cassandra nodes?

ravikiran · June 17, 2022, 4:03pm

3 nodes, rf=1.
On checking the cassandra logs, we see high latency at both LOCAL_SERIAL and LOCAL_QUORUM. Our DBA has also raised red flags on

usage of secondary indexes
queue like workload as this is anti pattern in cassandra.

We had a node down 2 days backs and our history services were not starting up throwing connection timeout.

maxim · June 17, 2022, 6:00pm

Secondary indexes are used only for visibility. For high throughput use cases don’t use Cassandra for visibility. Use Elastic Search integration instead.
The Temporal implements queues by using immutable append only tables. This works fine. The anti-pattern is a naive implementation that mutates messages.

tihomir · June 17, 2022, 6:04pm

3 nodes, rf=1

You should set replication factor to 3 to prevent data loss

What is the strategy used? NetworkTopologyStrategy?

We had a node down 2 days backs and our history services were not starting up throwing connection timeout.

I think this too can be related to the replication factor used. If you use 3 you should be able to cope with one node being down/unavailable for most operations.

ravikiran · June 19, 2022, 7:17pm

@tihomir

Not quite following this statement. If RF=1, it means we should be able to cope with one node being down/unavailable right? If this is not true for Temporal, can you pls point me to the doc with recommended settings. I believe we are using the helm chart for deployment (https://gecgithub01.walmart.com/Ingestion-Services/item-ingestion-syncapi/pull/93)

ravikiran · June 19, 2022, 7:19pm

thanks @maxim. Will try the elastic search integration and will also respond back to our DBA on queue workload.

tihomir · June 19, 2022, 10:21pm

Not quite following this statement. If RF=1, it means we should be able to cope with one node being down/unavailable right?

Don’t think this is the case, see for example here.
Replication factory of 3 is recommended for production, more info here.

ravikiran · June 21, 2022, 6:04pm

@tihomir We updated the RF to 3 but still see this issue.
One more update from DBA which we were unaware of is that there is a backup cluster in another region and some of the writes/reads are happening cross region. DBA is asking us to look at driver configuration . Can you pls help us point to the config and how to stop these cross region requests.

ravikiran · June 28, 2022, 2:44pm

We had to migrate the temporal server to new C* cluster as multiple C* clusters are not recommended for Temporal. After this migration we do not see any such issues

Topic		Replies	Views
Temporal not writing to cassandra? Community Support cassandra	2	540	March 2, 2022
Activities are frequently timing out with minimal load Community Support java-sdk , cassandra , activity , web-ui	1	368	September 19, 2023
Workflow task timed out on GKE Community Support java-sdk , cassandra , metrics	6	1102	June 8, 2022
Temporal in idle state generating huge read/write traffic to Cassandra Community Support java-sdk	26	1568	July 21, 2021
Understanding Temporal internals Community Support	5	4369	September 5, 2020

Operation updateShard encounter timeout

Related topics