We are facing this issue sporadically every alternate day, we are unable to replicate this issue. We would like to understand the probable cause or reason behind this. We have tried fixes like finetuning connection pool in history configs. Please give some help to get through this issue. We are running in Kubernetes.
Thanks
Message: Fail to process task
Error: Context deadline exceeded
Component: transfer-queue-processor
Type: TransferWorkflowTask
in most cases that have seen this was related to system overload (db overload)
would check your resource exhausted server metric
sum(rate(service_errors_resource_exhausted{}[1m])) by (operation, resource_exhausted_cause)
and see if need to lower qps limits in dynamic config
qps dynamic configs based on service type:
frontend.persistenceMaxQPS
history.persistenceMaxQPS
matching.persistenceMaxQPS
Also would check if there is spike of persistence operations / errors at those times when you see log
sum(rate(persistence_requests{}[1m])) by (operation)
sum(rate(persistence_error_with_type[1m])) by (operation)
Thank you for your response, will do check and revert back to you as soon as possible.
