UpdateTaskList operation failed with Cadence matching-service

Andrew4d3 · March 22, 2021, 2:00pm

Yesterday we ran into some issues with our cadence setup. One of our machine instances began to increase the CPU usage up to 90% and all of the inbound workflow executions were stuck in “Scheduled” states. After checking the logs, we noticed that the matching service was throwing the following error:

{
  "level": "error",
  "ts": "2021-03-20T14:41:55.130Z",
  "msg": "Operation failed with internal error.",
  "service": "cadence-matching",
  "error": "InternalServiceError{Message: UpdateTaskList operation failed. Error: gocql: no hosts available in the pool}",
  "metric-scope": 34,
  "logging-call-at": "persistenceMetricClients.go:872",
  "stacktrace": "github.com/uber/cadence/common/log/loggerimpl.(*loggerImpl).Error\n\t/cadence/common/log/loggerimpl/logger.go:134\ngithub.com/uber/cadence/common/persistence.(*taskPersistenceClient).updateErrorMetric\n\t/cadence/common/persistence/persistenceMetricClients.go:872\ngithub.com/uber/cadence/common/persistence.(*taskPersistenceClient).UpdateTaskList\n\t/cadence/common/persistence/persistenceMetricClients.go:855\ngithub.com/uber/cadence/service/matching.(*taskListDB).UpdateState\n\t/cadence/service/matching/db.go:103\ngithub.com/uber/cadence/service/matching.(*taskReader).persistAckLevel\n\t/cadence/service/matching/taskReader.go:277\ngithub.com/uber/cadence/service/matching.(*taskReader).getTasksPump\n\t/cadence/service/matching/taskReader.go:156"
}

After restarting the workflow, everything went back to normal, but we’re still trying to figure out what happened. We were not presenting any heavy workload at the moment of this event, it just happened suddenly. Our major suspicion is that probably the matching service lost connectivity with the cassandra database during this event and just after we restarted it, it was able to pick it up. But this is just an hypothesis at this moment.

What might have the cause of this problem been? and is there a way to prevent this from happening in the future? Maybe some dynamic config that we’re missing out?

PS: Cadence version is 0.18.3

Wenquan_Xing · March 22, 2021, 6:30pm

^ is a known issue from gocql:

The issue is fixed on Temporal master, and to be part of release of 1.8.x:
(not sure about cadence, maybe worth consider using Temporal?)

sbansal · June 2, 2021, 3:02am

I am running temporal 1.9.2, but I still see the issue.

{"level":"error","ts":"2021-06-02T02:54:23.495Z","msg":"Error refreshing namespace cache","service":"worker","error":"gocql: no hosts available in the pool","logging-call-at":"namespaceCache.go:414","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:136\ngo.temporal.io/server/common/cache.(*namespaceCache).refreshLoop\n\t/temporal/common/cache/namespaceCache.go:414"}

kubectl exec temporaltest-admintools-5799478b5c-5hjrp -- tctl help
NAME:
   tctl - A command-line tool for Temporal users

USAGE:
   tctl [global options] command [command options] [arguments...]

VERSION:
   1.9.2

Is there something else I need to do?

Wenquan_Xing · June 16, 2021, 11:28pm

above error can happen as transient error
the underlying gocql wrapper will create a new session after seen this error

Topic		Replies	Views
Gocql: no hosts available in the pool Community Support cassandra , cadence	5	3176	May 2, 2023
Getting tons of DecisionTaskTimedOut after scaling out the matching service in Uber cadence Community Support	7	1113	February 23, 2021
Error with matching service in Cadence Community Support	1	476	March 8, 2021
Errors in temporal history and matching service logs Community Support cassandra , deployment	2	1212	July 7, 2022
Remote sync match failed and Matching host rps exceeded Community Support go-sdk , cadence	1	853	May 25, 2021

UpdateTaskList operation failed with Cadence matching-service

Related topics