hi,
we are facing Workflow timeouts and are unable to spawn childworkflows and sometimes even the activities. Our history nodes are having below errors. Appreciate any suggestions
{“level”:“error”,“ts”:“2022-06-16T18:24:47.075Z”,“msg”:“Fail to process task”,“service”:“history”,“shard-id”:4001,“address”:“x.x.x.x:7234”,“shard-item”:“0xc00c4ff000”,“component”:“transfer-queue-processor”,“cluster-name”:“active”,“shard-id”:4001,“queue-task-id”:32505900,“queue-task-visibility-timestamp”:1655403666644602000,“xdc-failover-version”:0,“queue-task-type”:“TransferStartChildExecution”,“wf-namespace-id”:“fd9975c9-6798-406c-978d-d9ac0af87dd1”,“wf-id”:“EF632AA469B04BB8829137BC00AF64CA@AVABCgA”,“wf-run-id”:“22b66829-0cda-4992-ac97-78eedbde3894”,“error”:“operation UpdateShard encounter Operation timed out - received only 1 responses.”,“lifecycle”:“ProcessingFailed”,“logging-call-at”:“taskProcessor.go:326”,“stacktrace”:“go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:136\ngo.temporal.io/server/service/history.(*taskProcessor).handleTaskError\n\t/temporal/service/history/taskProcessor.go:326\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck.func1\n\t/temporal/service/history/taskProcessor.go:212\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:103\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck\n\t/temporal/service/history/taskProcessor.go:238\ngo.temporal.io/server/service/history.(*taskProcessor).taskWorker\n\t/temporal/service/history/taskProcessor.go:161”}
What is your configured cassandra replication factor?
Do you see any errors in db logs?
Can you also check that system time is in sync between your cassandra nodes?
3 nodes, rf=1.
On checking the cassandra logs, we see high latency at both LOCAL_SERIAL and LOCAL_QUORUM. Our DBA has also raised red flags on
- usage of secondary indexes
- queue like workload as this is anti pattern in cassandra.
We had a node down 2 days backs and our history services were not starting up throwing connection timeout.
3 nodes, rf=1
You should set replication factor to 3 to prevent data loss
What is the strategy used? NetworkTopologyStrategy?
We had a node down 2 days backs and our history services were not starting up throwing connection timeout.
I think this too can be related to the replication factor used. If you use 3 you should be able to cope with one node being down/unavailable for most operations.
@tihomir
Not quite following this statement. If RF=1, it means we should be able to cope with one node being down/unavailable right? If this is not true for Temporal, can you pls point me to the doc with recommended settings. I believe we are using the helm chart for deployment (https://gecgithub01.walmart.com/Ingestion-Services/item-ingestion-syncapi/pull/93)
thanks @maxim. Will try the elastic search integration and will also respond back to our DBA on queue workload.
Not quite following this statement. If RF=1, it means we should be able to cope with one node being down/unavailable right?
Don’t think this is the case, see for example here.
Replication factory of 3 is recommended for production, more info here.
@tihomir We updated the RF to 3 but still see this issue.
One more update from DBA which we were unaware of is that there is a backup cluster in another region and some of the writes/reads are happening cross region. DBA is asking us to look at driver configuration . Can you pls help us point to the config and how to stop these cross region requests.
We had to migrate the temporal server to new C* cluster as multiple C* clusters are not recommended for Temporal. After this migration we do not see any such issues