For background, we are running a temporal cluster on AWS, with all temporal services as ECS tasks and the frontend service behind an ALB. We support a number of different teams and use cases across Golang, Ruby, and Python.
The cluster is working fine generally; however, there is a roughly 0.2% error rate of 502 errors we see in our python and ruby clients. I believe this same error is being reported as a “context deadline exceeded” error in golang. Golang reports the deadline exceeded error much less frequently than 502’s are reported in the other languages, but I think this may be a result of the retry policies for gRPC requests buried in the SDK.
The issue I would like advice on has 2 dimensions
I believe these errors result from timeouts inside the temporal services service to service communication leading to a dropped connection to the ALB and thus the python and ruby clients receive a 502 from it. I think the context deadline exceeded errors from golang have the same root cause, because I have noticed that no matter what deadline I set on the passed context for a request, when it fails, it will mark the deadline exceeded at 10s, which is the same as the default service to service communication timeout. Does this explanation make sense?
I would really like a way to know what caused the “context deadline exceeded” error to ensure there isn’t something else bad happening. I tried a few ways of getting more data out of the exception but it is pretty empty. Is there a better way to parse that error to get more info? From my reading of the code I am concerned that the context is just being cancelled when this error is encountered and thus all metadata from the true error is lost.
How do I mitigate the issue in future? If my assumption of the root cause is correct, I need to scale up Temporal; however, it is hardly pulling any resources as it is now. I am throwing a few more container instances at it to see if the error rate drops, but in my ideal world this is something I can solve with autoscaling. I do not see any real CPU load or memory load on the system, which are the easier things for me to set autoscaling policies against. Is there a way to tell each container to do more? Is it in reality just highly CPU and memory efficient and I need to autoscale on connections or some other network resource instead?
Context deadline exceeded on the client can be a symptom for a lot of different issues on the server or network layer and simply means that your request hasn’t been replied to in time.
I have access to all server logs. The only suspicious events in the same timeframe are “history size exceeds warn limit” but they are on a different workflow in a different namespace.
I do not think these errors are the result of not replying in time based on the passed context. I have been setting high deadlines, e.g. 30-60s and then validating. They come back with a deadline exceeded at 10s from the time I initiate the request regardless of the +30-60s deadline set in the passed context.
I am usually able to reproduce when I blast the system with 1000 workflow starts within a few seconds. I do see spikes in request volume 1-2 minutes ahead of a failure (about 5k requests/20s to 9k requests/20s), but not at the same time as a failure. There is no visible impact to CPU or mem in the window.
Perhaps I should be increasing some limits to allow it to consume more resources? Are there tuning parameters I can adjust to absorb the request volume spikes better?
What was happening was that temporal server was closing connections to the AWS ALB when they aged out, which resulted in clients receiving 502 Bad Gateway errors on their following requests. I adjusted these values to higher than the lifespan I allow my temporal server instances, which I redeploy nightly. My 502 errorrate dropped from 0.1% with the default settings to 0%.
If you do something similar I would recommend looking out for hot spotting. I did not see any hot spotting or connection growth as a result of this change, but mileage may vary depending on the load balancer and traffic you have.
Correct. Essentially I delayed the connection age recycle to longer than the lifespan of the container. I do not think this is a good idea if you use a more expected deployment pattern of look aside load balancing or a k8s deployment, but in my case of deployed behind an AWS ALB this got rid of my 502 error rate.
I added a hot spotting check, but I think the ALB is handling that very well as I have not had that alarm go off.
Update the Helm values and then kill the frontend pod so it restarted.
I have Linkerd set up on my k8s cluster, so it’s not a lack of gRPC load balancing. For some strange reason the connection is always killed early, during a poll, and the clients can’t recover leading to them hanging. Usually for 10-15m which is obviously terrible for throughput.
I have fairly small throughput requirements so I only have 1 replica of the frontend deployment set up. So having a long max age is irrelevant for me. If you have more replicas I would find another solution.