How to best handle mysterious context deadline exceeded/502 errors

For background, we are running a temporal cluster on AWS, with all temporal services as ECS tasks and the frontend service behind an ALB. We support a number of different teams and use cases across Golang, Ruby, and Python.

The cluster is working fine generally; however, there is a roughly 0.2% error rate of 502 errors we see in our python and ruby clients. I believe this same error is being reported as a “context deadline exceeded” error in golang. Golang reports the deadline exceeded error much less frequently than 502’s are reported in the other languages, but I think this may be a result of the retry policies for gRPC requests buried in the SDK.

The issue I would like advice on has 2 dimensions

  1. I believe these errors result from timeouts inside the temporal services service to service communication leading to a dropped connection to the ALB and thus the python and ruby clients receive a 502 from it. I think the context deadline exceeded errors from golang have the same root cause, because I have noticed that no matter what deadline I set on the passed context for a request, when it fails, it will mark the deadline exceeded at 10s, which is the same as the default service to service communication timeout. Does this explanation make sense?
    I would really like a way to know what caused the “context deadline exceeded” error to ensure there isn’t something else bad happening. I tried a few ways of getting more data out of the exception but it is pretty empty. Is there a better way to parse that error to get more info? From my reading of the code I am concerned that the context is just being cancelled when this error is encountered and thus all metadata from the true error is lost.

  2. How do I mitigate the issue in future? If my assumption of the root cause is correct, I need to scale up Temporal; however, it is hardly pulling any resources as it is now. I am throwing a few more container instances at it to see if the error rate drops, but in my ideal world this is something I can solve with autoscaling. I do not see any real CPU load or memory load on the system, which are the easier things for me to set autoscaling policies against. Is there a way to tell each container to do more? Is it in reality just highly CPU and memory efficient and I need to autoscale on connections or some other network resource instead?

Any advice appreciated. Thank you!

1 Like

Hey Tristan,

Context deadline exceeded on the client can be a symptom for a lot of different issues on the server or network layer and simply means that your request hasn’t been replied to in time.

  • Do you have access to the server log?
  • Are there any other errors that you can see?
  • Is there any pattern to these errors?

Thank you for the reply.

I have access to all server logs. The only suspicious events in the same timeframe are “history size exceeds warn limit” but they are on a different workflow in a different namespace.

I do not think these errors are the result of not replying in time based on the passed context. I have been setting high deadlines, e.g. 30-60s and then validating. They come back with a deadline exceeded at 10s from the time I initiate the request regardless of the +30-60s deadline set in the passed context.

I am usually able to reproduce when I blast the system with 1000 workflow starts within a few seconds. I do see spikes in request volume 1-2 minutes ahead of a failure (about 5k requests/20s to 9k requests/20s), but not at the same time as a failure. There is no visible impact to CPU or mem in the window.

Perhaps I should be increasing some limits to allow it to consume more resources? Are there tuning parameters I can adjust to absorb the request volume spikes better?

I was able to root cause and resolve this issue. It was the result of connection age based cycling from temporal server as controlled by these params:

frontend.keepAliveMaxConnectionAge:
  - value: 5m
frontend.keepAliveMaxConnectionAgeGrace:
  - value: 70s

What was happening was that temporal server was closing connections to the AWS ALB when they aged out, which resulted in clients receiving 502 Bad Gateway errors on their following requests. I adjusted these values to higher than the lifespan I allow my temporal server instances, which I redeploy nightly. My 502 errorrate dropped from 0.1% with the default settings to 0%.

If you do something similar I would recommend looking out for hot spotting. I did not see any hot spotting or connection growth as a result of this change, but mileage may vary depending on the load balancer and traffic you have.

3 Likes

Quick check : “I adjusted these values to higher than the lifespan I allow my temporal server instances, which I redeploy nightly”

May I know the values that you had configured please? Were these in range of 24 hours? as the instances were recycled / redeployed nightly?

Correct. Essentially I delayed the connection age recycle to longer than the lifespan of the container. I do not think this is a good idea if you use a more expected deployment pattern of look aside load balancing or a k8s deployment, but in my case of deployed behind an AWS ALB this got rid of my 502 error rate.

I added a hot spotting check, but I think the ALB is handling that very well as I have not had that alarm go off.

Thanks. It helps.

Thank you so much for this. For anyone who’s facing similar issues in k8s I had to set this up in Helm:

server:
  dynamicConfig:
      frontend.keepAliveMaxConnectionAge:
        - value: 48h
          constraints: {}

Update the Helm values and then kill the frontend pod so it restarted.

I have Linkerd set up on my k8s cluster, so it’s not a lack of gRPC load balancing. For some strange reason the connection is always killed early, during a poll, and the clients can’t recover leading to them hanging. Usually for 10-15m which is obviously terrible for throughput.

I have fairly small throughput requirements so I only have 1 replica of the frontend deployment set up. So having a long max age is irrelevant for me. If you have more replicas I would find another solution.