Hello
We’re using temporal 1.22, containers on kubernetes, self-hosted, with postgres as the state store. Workers are on typescript.
We ran into a few situations where the data we ideally want serialized exceeded the typical gRPC payload size limits. Using this project as a reference (GitHub - DataDog/temporal-large-payload-codec: HTTP service and accompanying Temporal Payload Codec which allows Temporal clients to automatically persist large payloads outside of workflow histories.), we implemented something similar, but instead backed by Redis. And it works pretty well, for payloads between ~2mb-50mb, and sometimes even larger.
Beyond that, the system tends to run into a “Workflow task failed; Script execution timed out after 5000ms”. Not every time, naturally, but predictably enough that it causes problems.
My leading theory is that the codec itself is taking so long to transfer the data to (or from) redis that some kind of worker deadlock detection is being triggered. So, this leads to a couple questions:
-
Can this 5000ms deadlock detection be increased? Naturally that isn’t a long-term solution, but it would at least band-aid the problem.
-
Is there some way to heartbeat to the control plane from the codec while the codec is encoding/decoding, to satisfy the deadlock detection? One thought was that the
heartbeat
function from@temporalio/activity
package could accomplish this, but that of course hits aActivity context not initialized
error; because we’re not inside an activity.
Any help in tackling this would be greatly appreciated; thanks!