Yesterday we ran into some issues with our cadence setup. One of our machine instances began to increase the CPU usage up to 90% and all of the inbound workflow executions were stuck in “Scheduled” states. After checking the logs, we noticed that the matching service was throwing the following error:
{
"level": "error",
"ts": "2021-03-20T14:41:55.130Z",
"msg": "Operation failed with internal error.",
"service": "cadence-matching",
"error": "InternalServiceError{Message: UpdateTaskList operation failed. Error: gocql: no hosts available in the pool}",
"metric-scope": 34,
"logging-call-at": "persistenceMetricClients.go:872",
"stacktrace": "github.com/uber/cadence/common/log/loggerimpl.(*loggerImpl).Error\n\t/cadence/common/log/loggerimpl/logger.go:134\ngithub.com/uber/cadence/common/persistence.(*taskPersistenceClient).updateErrorMetric\n\t/cadence/common/persistence/persistenceMetricClients.go:872\ngithub.com/uber/cadence/common/persistence.(*taskPersistenceClient).UpdateTaskList\n\t/cadence/common/persistence/persistenceMetricClients.go:855\ngithub.com/uber/cadence/service/matching.(*taskListDB).UpdateState\n\t/cadence/service/matching/db.go:103\ngithub.com/uber/cadence/service/matching.(*taskReader).persistAckLevel\n\t/cadence/service/matching/taskReader.go:277\ngithub.com/uber/cadence/service/matching.(*taskReader).getTasksPump\n\t/cadence/service/matching/taskReader.go:156"
}
After restarting the workflow, everything went back to normal, but we’re still trying to figure out what happened. We were not presenting any heavy workload at the moment of this event, it just happened suddenly. Our major suspicion is that probably the matching service lost connectivity with the cassandra database during this event and just after we restarted it, it was able to pick it up. But this is just an hypothesis at this moment.
What might have the cause of this problem been? and is there a way to prevent this from happening in the future? Maybe some dynamic config that we’re missing out?
PS: Cadence version is 0.18.3