How to best handle mysterious context deadline exceeded/502 errors

Thank you for the reply.

I have access to all server logs. The only suspicious events in the same timeframe are “history size exceeds warn limit” but they are on a different workflow in a different namespace.

I do not think these errors are the result of not replying in time based on the passed context. I have been setting high deadlines, e.g. 30-60s and then validating. They come back with a deadline exceeded at 10s from the time I initiate the request regardless of the +30-60s deadline set in the passed context.

I am usually able to reproduce when I blast the system with 1000 workflow starts within a few seconds. I do see spikes in request volume 1-2 minutes ahead of a failure (about 5k requests/20s to 9k requests/20s), but not at the same time as a failure. There is no visible impact to CPU or mem in the window.

Perhaps I should be increasing some limits to allow it to consume more resources? Are there tuning parameters I can adjust to absorb the request volume spikes better?