Mysql connectivity issues

Hi. We are facing issues in mysql connectivity from history k8s pods. Seeing errors such as

{"level":"error","ts":"2020-07-29T10:25:56.624Z","msg":"Operation failed with internal error.","service":"cadence-history","error":"InternalServiceError{Message: GetTimerTasks operation failed. Select failed. Error: write tcp 10.97.9.0:37288->10.97.1.220:3306: write: broken pipe}","metric-scope":20,"shard-id":663,"logging-call-at":"persistenceMetricClients.go:523","stacktrace":"github.com/uber/cadence/common/log/loggerimpl.(*loggerImpl).Error\n\t/cadence/common/log/loggerimpl/logger.go:134\ngithub.com/uber/cadence/common/persistence.(*workflowExecutionPersistenceClient).updateErrorMetric\n\t/cadence/common/persistence/persistenceMetricClients.go:523\ngithub.com/uber/cadence/common/persistence.(*workflowExecutionPersistenceClient).GetTimerIndexTasks\n\t/cadence/common/persistence/persistenceMetricClients.go:457\ngithub.com/uber/cadence/service/history.(*timerQueueAckMgrImpl).getTimerTasks\n\t/cadence/service/history/timerQueueAckMgr.go:393\ngithub.com/uber/cadence/service/history.(*timerQueueAckMgrImpl).readTimerTasks\n\t/cadence/service/history/timerQueueAckMgr.go:208\ngithub.com/uber/cadence/service/history.(*timerQueueProcessorBase).readAndFanoutTimerTasks\n\t/cadence/service/history/timerQueueProcessorBase.go:341\ngithub.com/uber/cadence/service/history.(*timerQueueProcessorBase).internalProcessor\n\t/cadence/service/history/timerQueueProcessorBase.go:292\ngithub.com/uber/cadence/service/history.(*timerQueueProcessorBase).processorPump\n\t/cadence/service/history/timerQueueProcessorBase.go:164"}

{"level":"error","ts":"2020-07-29T10:25:56.616Z","msg":"Operation failed with internal error.","service":"cadence-history","error":"InternalServiceError{Message: GetTransferTasks operation failed. Select failed. Error: EOF}","metric-scope":12,"shard-id":607,"logging-call-at":"persistenceMetricClients.go:523","stacktrace":"github.com/uber/cadence/common/log/loggerimpl.(*loggerImpl).Error\n\t/cadence/common/log/loggerimpl/logger.go:134\ngithub.com/uber/cadence/common/persistence.(*workflowExecutionPersistenceClient).updateErrorMetric\n\t/cadence/common/persistence/persistenceMetricClients.go:523\ngithub.com/uber/cadence/common/persistence.(*workflowExecutionPersistenceClient).GetTransferTasks\n\t/cadence/common/persistence/persistenceMetricClients.go:341\ngithub.com/uber/cadence/service/history.(*transferQueueProcessorBase).readTasks\n\t/cadence/service/history/transferQueueProcessorBase.go:96\ngithub.com/uber/cadence/service/history.(*queueAckMgrImpl).readQueueTasks.func1\n\t/cadence/service/history/queueAckMgr.go:103\ngithub.com/uber/cadence/common/backoff.Retry\n\t/cadence/common/backoff/retry.go:99\ngithub.com/uber/cadence/service/history.(*queueAckMgrImpl).readQueueTasks\n\t/cadence/service/history/queueAckMgr.go:107\ngithub.com/uber/cadence/service/history.(*queueProcessorBase).processBatch\n\t/cadence/service/history/queueProcessor.go:215\ngithub.com/uber/cadence/service/history.(*queueProcessorBase).processorPump\n\t/cadence/service/history/queueProcessor.go:190"}

{"level":"error","ts":"2020-07-29T10:26:01.313Z","msg":"Operation failed with internal error.","service":"cadence-history","error":"InternalServiceError{Message: GetTransferTasks operation failed. Select failed. Error: invalid connection}","metric-scope":12,"shard-id":612,"logging-call-at":"persistenceMetricClients.go:523","stacktrace":"github.com/uber/cadence/common/log/loggerimpl.(*loggerImpl).Error\n\t/cadence/common/log/loggerimpl/logger.go:134\ngithub.com/uber/cadence/common/persistence.(*workflowExecutionPersistenceClient).updateErrorMetric\n\t/cadence/common/persistence/persistenceMetricClients.go:523\ngithub.com/uber/cadence/common/persistence.

[mysql] 2020/07/29 10:25:56 packets.go:36: unexpected EOF

After many such errors, mysql seems to block connections from that pod:

"error":"InternalServiceError{Message: GetTransferTasks operation failed. Select failed. Error: Error 1129: Host '10.97.9.0' is blocked because of many connection errors; unblock with 'mysqladmin flush-hosts'}

This has happened multiple times over the last week. When we do flush-hosts, it seems to fix it, but it happens again.
We haven’t configured a connection pool on any of our services. It looks like the broken pipe errors are happening because the sql driver on the client is using closed connections. Has anybody else faced this issue?

2 Likes

Could you check if this topics solution applies to you?

1 Like

@Guruprasad_Venkatesh If the topic that @maxim linked above does not help you - can you share a bit more about your topology in terms of how many nodes/roles you are running? Additionally, can you share your max_connect_errors setting from MySQL. You can get this by running the following query: SELECT @@GLOBAL.max_connect_errors.

@maxim, @shawn: Many thanks. Setting the connection pool parameters fixed this issue. Don’t see invalid connection errors anymore.

2 Likes