History server context deadline exceed errors every hour

Hello, we’ve been having the following recurrent error happening every 18minute every hour on history servers:

{"level":"error","ts":"2022-09-29T05:18:16.597Z","msg":"Operation failed with internal error.","error":"GetWorkflowExecution: failed to get request cancel info. Error: Failed to get request cancel info. Error: context deadline exceeded","metric-scope":5,"logging-call-at":"persistenceMetricClients.go:1579","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:143\ngo.temporal.io/server/common/persistence.(*metricEmitter).updateErrorMetric\n\t/home/builder/temporal/common/persistence/persistenceMetricClients.go:1579\ngo.temporal.io/server/common/persistence.(*executionPersistenceClient).GetWorkflowExecution\n\t/home/builder/temporal/common/persistence/persistenceMetricClients.go:247\ngo.temporal.io/server/common/persistence.(*executionRetryablePersistenceClient).GetWorkflowExecution.func1\n\t/home/builder/temporal/common/persistence/persistenceRetryableClients.go:228\ngo.temporal.io/server/common/backoff.ThrottleRetryContext\n\t/home/builder/temporal/common/backoff/retry.go:194\ngo.temporal.io/server/common/persistence.(*executionRetryablePersistenceClient).GetWorkflowExecution\n\t/home/builder/temporal/common/persistence/persistenceRetryableClients.go:232\ngo.temporal.io/server/service/history/shard.(*ContextImpl).GetWorkflowExecution\n\t/home/builder/temporal/service/history/shard/context_impl.go:902\ngo.temporal.io/server/service/history/workflow.getWorkflowExecution\n\t/home/builder/temporal/service/history/workflow/transaction_impl.go:425\ngo.temporal.io/server/service/history/workflow.(*ContextImpl).LoadMutableState\n\t/home/builder/temporal/service/history/workflow/context.go:270\ngo.temporal.io/server/service/history.LoadMutableStateForTask\n\t/home/builder/temporal/service/history/nDCTaskUtil.go:142\ngo.temporal.io/server/service/history.loadMutableStateForTimerTask\n\t/home/builder/temporal/service/history/nDCTaskUtil.go:123\ngo.temporal.io/server/service/history.(*timerQueueActiveTaskExecutor).executeActivityTimeoutTask\n\t/home/builder/temporal/service/history/timerQueueActiveTaskExecutor.go:192\ngo.temporal.io/server/service/history.(*timerQueueActiveTaskExecutor).Execute\n\t/home/builder/temporal/service/history/timerQueueActiveTaskExecutor.go:108\ngo.temporal.io/server/service/history/queues.(*executorWrapper).Execute\n\t/home/builder/temporal/service/history/queues/executor_wrapper.go:67\ngo.temporal.io/server/service/history/queues.(*executableImpl).Execute\n\t/home/builder/temporal/service/history/queues/executable.go:201\ngo.temporal.io/server/common/tasks.(*FIFOScheduler[...]).executeTask.func1\n\t/home/builder/temporal/common/tasks/fifo_scheduler.go:225\ngo.temporal.io/server/common/backoff.ThrottleRetry.func1\n\t/home/builder/temporal/common/backoff/retry.go:170\ngo.temporal.io/server/common/backoff.ThrottleRetryContext\n\t/home/builder/temporal/common/backoff/retry.go:194\ngo.temporal.io/server/common/backoff.ThrottleRetry\n\t/home/builder/temporal/common/backoff/retry.go:171\ngo.temporal.io/server/common/tasks.(*FIFOScheduler[...]).executeTask\n\t/home/builder/temporal/common/tasks/fifo_scheduler.go:235\ngo.temporal.io/server/common/tasks.(*FIFOScheduler[...]).processTask\n\t/home/builder/temporal/common/tasks/fifo_scheduler.go:211"}

Most of the error are cause by “context deadline exceed”. but we have no any hourly cron job on temporal.

On the time when error occur, the metrics of MySQL “threads_running” goes high

We trying to execute “show full processlist” on MySQL every 5 second, near the time when error occur, we could see “START TRANSACTION” statement

Is there any cron job that history serve would execute every hour with a lot of operate to MySQL?

GetWorkflowExecution: failed to get request cancel info.

Believe this comes from this part of the code that tries to find the workflow executions that your workflows are trying to cancel (send cancel requests for).
It could indicate db corruption, would check the state of your db, look at logs or anything that could be going wrong on the db site.

Also what are the server and mysql versions you are running on?

mysql version: 8.0.26-17
temporal version: 1.18.0
temporal was deployed on k8s by using helm chart

I believe there is no any business logical would trying to send cancel request. because the task is coming from schedule every 5minute

The problem solved. After the binlog archive turned off by dba…