UpdateTaskList operation failed with Cadence matching-service

Yesterday we ran into some issues with our cadence setup. One of our machine instances began to increase the CPU usage up to 90% and all of the inbound workflow executions were stuck in “Scheduled” states. After checking the logs, we noticed that the matching service was throwing the following error:

  "level": "error",
  "ts": "2021-03-20T14:41:55.130Z",
  "msg": "Operation failed with internal error.",
  "service": "cadence-matching",
  "error": "InternalServiceError{Message: UpdateTaskList operation failed. Error: gocql: no hosts available in the pool}",
  "metric-scope": 34,
  "logging-call-at": "persistenceMetricClients.go:872",
  "stacktrace": "github.com/uber/cadence/common/log/loggerimpl.(*loggerImpl).Error\n\t/cadence/common/log/loggerimpl/logger.go:134\ngithub.com/uber/cadence/common/persistence.(*taskPersistenceClient).updateErrorMetric\n\t/cadence/common/persistence/persistenceMetricClients.go:872\ngithub.com/uber/cadence/common/persistence.(*taskPersistenceClient).UpdateTaskList\n\t/cadence/common/persistence/persistenceMetricClients.go:855\ngithub.com/uber/cadence/service/matching.(*taskListDB).UpdateState\n\t/cadence/service/matching/db.go:103\ngithub.com/uber/cadence/service/matching.(*taskReader).persistAckLevel\n\t/cadence/service/matching/taskReader.go:277\ngithub.com/uber/cadence/service/matching.(*taskReader).getTasksPump\n\t/cadence/service/matching/taskReader.go:156"

After restarting the workflow, everything went back to normal, but we’re still trying to figure out what happened. We were not presenting any heavy workload at the moment of this event, it just happened suddenly. Our major suspicion is that probably the matching service lost connectivity with the cassandra database during this event and just after we restarted it, it was able to pick it up. But this is just an hypothesis at this moment.

What might have the cause of this problem been? and is there a way to prevent this from happening in the future? Maybe some dynamic config that we’re missing out?

PS: Cadence version is 0.18.3

^ is a known issue from gocql:

The issue is fixed on Temporal master, and to be part of release of 1.8.x:
(not sure about cadence, maybe worth consider using Temporal?)

I am running temporal 1.9.2, but I still see the issue.

{"level":"error","ts":"2021-06-02T02:54:23.495Z","msg":"Error refreshing namespace cache","service":"worker","error":"gocql: no hosts available in the pool","logging-call-at":"namespaceCache.go:414","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:136\ngo.temporal.io/server/common/cache.(*namespaceCache).refreshLoop\n\t/temporal/common/cache/namespaceCache.go:414"}
kubectl exec temporaltest-admintools-5799478b5c-5hjrp -- tctl help
   tctl - A command-line tool for Temporal users

   tctl [global options] command [command options] [arguments...]


Is there something else I need to do?

above error can happen as transient error
the underlying gocql wrapper will create a new session after seen this error