For background, we are running a temporal cluster on AWS, with all temporal services as ECS tasks and the frontend service behind an ALB. We support a number of different teams and use cases across Golang, Ruby, and Python.
The cluster is working fine generally; however, there is a roughly 0.2% error rate of 502 errors we see in our python and ruby clients. I believe this same error is being reported as a “context deadline exceeded” error in golang. Golang reports the deadline exceeded error much less frequently than 502’s are reported in the other languages, but I think this may be a result of the retry policies for gRPC requests buried in the SDK.
The issue I would like advice on has 2 dimensions
I believe these errors result from timeouts inside the temporal services service to service communication leading to a dropped connection to the ALB and thus the python and ruby clients receive a 502 from it. I think the context deadline exceeded errors from golang have the same root cause, because I have noticed that no matter what deadline I set on the passed context for a request, when it fails, it will mark the deadline exceeded at 10s, which is the same as the default service to service communication timeout. Does this explanation make sense?
I would really like a way to know what caused the “context deadline exceeded” error to ensure there isn’t something else bad happening. I tried a few ways of getting more data out of the exception but it is pretty empty. Is there a better way to parse that error to get more info? From my reading of the code I am concerned that the context is just being cancelled when this error is encountered and thus all metadata from the true error is lost.
How do I mitigate the issue in future? If my assumption of the root cause is correct, I need to scale up Temporal; however, it is hardly pulling any resources as it is now. I am throwing a few more container instances at it to see if the error rate drops, but in my ideal world this is something I can solve with autoscaling. I do not see any real CPU load or memory load on the system, which are the easier things for me to set autoscaling policies against. Is there a way to tell each container to do more? Is it in reality just highly CPU and memory efficient and I need to autoscale on connections or some other network resource instead?
Any advice appreciated. Thank you!