Getting the intermittent "no response received from cassandra within timeout period"

Hi, we are getting some connectivity errors in C*. We suspect it may be a network issue, although the error is reproducible on 2 clusters so maybe down to the configuration.

Is there a configuration which defines the timeout or this is a default cassandra?

Datastore config:

datastores:
    default:
      cassandra:
        hosts: "dev-westeurope-01-primary-dc1-service.temporal-state.svc.cluster.local"
        port: 9042
        user: "{{ .Env.TEMPORAL_STORE_USERNAME }}"
        password: "{{ .Env.TEMPORAL_STORE_PASSWORD }}"
        connectTimeout: 2s
        consistency:
          default:
            consistency: local_quorum
            serialConsistency: local_serial
        datacenter: dc1
        disableInitialHostLookup: true
        keyspace: temporal
        replicationFactor: 3

Errors

{"level":"warn","ts":"2022-03-10T17:20:19.892Z","msg":"Processor unable to retrieve tasks","service":"history","shard-id":1814,"address":"10.1.6.176:7234","component":"visibility-queue-processor","error":"operation GetVisibilityTasks encountered gocql: no response received from cassandra within timeout period","logging-call-at":"queueProcessor.go:248"}
{"level":"error","ts":"2022-03-10T17:22:27.492Z","msg":"Error updating ack level for shard","service":"history","shard-id":958,"address":"10.1.6.176:7234","component":"transfer-queue-processor","cluster-name":"dev-westeurope-01-secondary","error":"operation UpdateShard encountered gocql: no response received from cassandra within timeout period","operation-result":"OperationFailed","logging-call-at":"queueAckMgr.go:225","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/go/pkg/mod/go.temporal.io/server@v1.14.4/common/log/zap_logger.go:142\ngo.temporal.io/server/service/history.(*queueAckMgrImpl).updateQueueAckLevel\n\t/go/pkg/mod/go.temporal.io/server@v1.14.4/service/history/queueAckMgr.go:225\ngo.temporal.io/server/service/history.(*queueProcessorBase).processorPump\n\t/go/pkg/mod/go.temporal.io/server@v1.14.4/service/history/queueProcessor.go:223"}

Cassandra

slow_query_log_timeout_in_ms: 500
write_request_timeout_in_ms: 2000
truncate_request_timeout_in_ms: 60000
read_request_timeout_in_ms: 5000
request_timeout_in_ms: 10000
cross_node_timeout: false
range_request_timeout_in_ms: 10000
counter_write_request_timeout_in_ms: 5000
cas_contention_timeout_in_ms: 1000

I will answer my own question - CPU limit on the history service was set too low 100m, this was a cause of the intermittent hand. Will post again if things changes but so far so good.

1 Like