Self-hosting Temporal server (currently v1.12.3) in K8s (EKS) and seeing the following errors on every workflow invocation:
{"level":"error","ts":"2022-02-07T22:48:59.553Z","msg":"Operation failed with an error.","logging-call-at":"visibility_manager_metrics.go:255","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:143\ngo.temporal.io/server/common/persistence/visibility.(*visibilityPersistenceClient).updateErrorMetric\n\t/temporal/common/persistence/visibility/visibility_manager_metrics.go:255\ngo.temporal.io/server/common/persistence/visibility.(*visibilityPersistenceClient).RecordWorkflowExecutionStarted\n\t/temporal/common/persistence/visibility/visibility_manager_metrics.go:68\ngo.temporal.io/server/common/persistence/visibility.(*visibilityManagerWrapper).RecordWorkflowExecutionStarted\n\t/temporal/common/persistence/visibility/visibility_manager_dual_write.go:78\ngo.temporal.io/server/service/history.(*visibilityQueueTaskExecutor).recordStartExecution\n\t/temporal/service/history/visibilityQueueTaskExecutor.go:262\ngo.temporal.io/server/service/history.(*visibilityQueueTaskExecutor).processStartOrUpsertExecution\n\t/temporal/service/history/visibilityQueueTaskExecutor.go:188\ngo.temporal.io/server/service/history.(*visibilityQueueTaskExecutor).execute\n\t/temporal/service/history/visibilityQueueTaskExecutor.go:114\ngo.temporal.io/server/service/history.(*visibilityQueueProcessorImpl).process\n\t/temporal/service/history/visibilityQueueProcessor.go:316\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskOnce\n\t/temporal/service/history/taskProcessor.go:269\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck.func1\n\t/temporal/service/history/taskProcessor.go:221\ngo.temporal.io/server/common/backoff.Retry.func1\n\t/temporal/common/backoff/retry.go:104\ngo.temporal.io/server/common/backoff.RetryContext\n\t/temporal/common/backoff/retry.go:125\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:105\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck\n\t/temporal/service/history/taskProcessor.go:248\ngo.temporal.io/server/service/history.(*taskProcessor).taskWorker\n\t/temporal/service/history/taskProcessor.go:171"}
{"level":"error","ts":"2022-02-07T22:48:59.915Z","msg":"Operation failed with an error.","logging-call-at":"visibility_manager_metrics.go:255","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:143\ngo.temporal.io/server/common/persistence/visibility.(*visibilityPersistenceClient).updateErrorMetric\n\t/temporal/common/persistence/visibility/visibility_manager_metrics.go:255\ngo.temporal.io/server/common/persistence/visibility.(*visibilityPersistenceClient).RecordWorkflowExecutionClosed\n\t/temporal/common/persistence/visibility/visibility_manager_metrics.go:81\ngo.temporal.io/server/common/persistence/visibility.(*visibilityManagerWrapper).RecordWorkflowExecutionClosed\n\t/temporal/common/persistence/visibility/visibility_manager_dual_write.go:94\ngo.temporal.io/server/service/history.(*visibilityQueueTaskExecutor).recordCloseExecution\n\t/temporal/service/history/visibilityQueueTaskExecutor.go:421\ngo.temporal.io/server/service/history.(*visibilityQueueTaskExecutor).processCloseExecution\n\t/temporal/service/history/visibilityQueueTaskExecutor.go:371\ngo.temporal.io/server/service/history.(*visibilityQueueTaskExecutor).execute\n\t/temporal/service/history/visibilityQueueTaskExecutor.go:118\ngo.temporal.io/server/service/history.(*visibilityQueueProcessorImpl).process\n\t/temporal/service/history/visibilityQueueProcessor.go:316\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskOnce\n\t/temporal/service/history/taskProcessor.go:269\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck.func1\n\t/temporal/service/history/taskProcessor.go:221\ngo.temporal.io/server/common/backoff.Retry.func1\n\t/temporal/common/backoff/retry.go:104\ngo.temporal.io/server/common/backoff.RetryContext\n\t/temporal/common/backoff/retry.go:125\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:105\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck\n\t/temporal/service/history/taskProcessor.go:248\ngo.temporal.io/server/service/history.(*taskProcessor).taskWorker\n\t/temporal/service/history/taskProcessor.go:171"}
So all I have to go on here is “Operation failed with an error” and a stack trace.
The stack trace points to this bit of code:
func (p *visibilityPersistenceClient) updateErrorMetric(scope metrics.Scope, err error) {
switch err.(type) {
case *persistence.ConditionFailedError:
scope.IncCounter(metrics.VisibilityPersistenceConditionFailed)
case *persistence.TimeoutError:
scope.IncCounter(metrics.VisibilityPersistenceTimeout)
scope.IncCounter(metrics.VisibilityPersistenceFailures)
case *serviceerror.NotFound:
scope.IncCounter(metrics.VisibilityPersistenceNotFound)
case *serviceerror.ResourceExhausted:
scope.IncCounter(metrics.VisibilityPersistenceResourceExhausted)
scope.IncCounter(metrics.VisibilityPersistenceFailures)
default:
p.logger.Error("Operation failed with an error.", tag.Error(err))
scope.IncCounter(metrics.VisibilityPersistenceFailures)
}
}
So there appears to be an unhandled error, but that error is not being logged.
The DB (RDS MySQL) health metrics look fine. I confirmed that the history service has connectivity to the database (exec’ed onto the container, installed a mysql client and connected to the DB). Any advice on what I should do next for troubleshooting?