Codec for handling large payloads encounters a script execution timeout after 5000ms

mhoc · May 15, 2024, 3:34pm

Hello

We’re using temporal 1.22, containers on kubernetes, self-hosted, with postgres as the state store. Workers are on typescript.

We ran into a few situations where the data we ideally want serialized exceeded the typical gRPC payload size limits. Using this project as a reference (GitHub - DataDog/temporal-large-payload-codec: HTTP service and accompanying Temporal Payload Codec which allows Temporal clients to automatically persist large payloads outside of workflow histories.), we implemented something similar, but instead backed by Redis. And it works pretty well, for payloads between ~2mb-50mb, and sometimes even larger.

Beyond that, the system tends to run into a “Workflow task failed; Script execution timed out after 5000ms”. Not every time, naturally, but predictably enough that it causes problems.

My leading theory is that the codec itself is taking so long to transfer the data to (or from) redis that some kind of worker deadlock detection is being triggered. So, this leads to a couple questions:

Can this 5000ms deadlock detection be increased? Naturally that isn’t a long-term solution, but it would at least band-aid the problem.
Is there some way to heartbeat to the control plane from the codec while the codec is encoding/decoding, to satisfy the deadlock detection? One thought was that the heartbeat function from @temporalio/activity package could accomplish this, but that of course hits a Activity context not initialized error; because we’re not inside an activity.

Any help in tackling this would be greatly appreciated; thanks!

jwatkins · May 15, 2024, 7:59pm

Hi! Payload codecs execute outside the workflow sandbox, whereas the “script execution timed out after 5000ms” error relates to execution time inside the sandbox, so this can’t be time spent in codec.

Can you provide some details on how you implemented this large payloads codec in TS? Do you also have custom payload converters that could be taking a long time to execute?

mhoc · May 15, 2024, 8:36pm

Thanks for the response. The gist of our codec is:

export const MyCodec = {
  encode: async (payloads: Payload[]): Promise<Payload[]> => {
    return await Promise.all(payloads.map(async (_payload) => {
        if (!_payload.data || _payload.data.length <= MINIMUM_SIZE) {
          return _payload;
        }
        const key = uuid()
        await redis.set(key, Buffer.from(_payload.data), {
          EX: 24 * 60 * 60,
          NX: true,
        });
        return defaultDataConverter.payloadConverter.toPayload({
            "largepayload:rediskey": key,
          }),
      }),
    );
  },
  decode: async (payloads: Payload[]): Promise<Payload[]> => {
    return await Promise.all(payloads.map(async (_payload): Promise<Payload> => {
        const payload = defaultDataConverter.payloadConverter.fromPayload(_payload) as any;
        if (!payload || !payload["largepayload:redis"]) {
          return _payload;
        }
        const value = await redis.get(payload["largepayload:redis"]);
        if (!value) throw new Error("ick")
        return {
          data: Buffer.from(value),
          metadata: _payload.metadata,
        };
      }),
    );
  },
}

It is pretty much that simple; the datadog repo has a separate http service spun up to handle the encoding/decoding; we just do it directly within the worker code. ‘redis’ in that example just comes from the nodejs redis library. And, we’re using the defaultDataConverter from @temporalio/common.

This timeout coming from within the workflow sandbox rather than the codec encoding/decoding makes sense to me. I feel high confidence that occurrences of the timeout correlate with the size of the payloads being encoded/decoded. We’ve encountered the timeout in situations where the workflow is as simple as:

const { MyActivity } = proxyActivities<typeof activities>({...});
export async function MyWorkflow(input) {
  return MyActivity(input);
}

Put another way; a workflow that is literally a single activity call (why we have a workflow like this is unrelated and more the result of an architecture decision made in the past; but suffice to say that parent workflows will do stuff with the data returned from this). The activity is executed; the activity returns 60-100mb of data; the codec encoder runs; successfully proxies the payload to redis; then the workflow history indicates that a new MyWorkflow task is scheduled, starts, and it hits the 5000ms timeout.

So; my assumption is that what’s timing out is something related specifically to decoding the payload on the workflow side (or, maybe, re-encoding the payload again for transfer to a parent workflow?). But, I’ll admit that my fundamental understanding of where/how codecs work with the internals of the system is a bit hazy.

Sorry, that was a lot; I’m super appreciative for your help and any thoughts you might have

maxim · May 15, 2024, 8:44pm

I would advise against the design that blindly uses a codec to offload large payloads to Redis. The problem is that loading such large payloads into the workflow is unnecessary in most situations and makes replay extremely slow.

I recommend implementing a solution that allows passing references to these payloads seamlessly from one activity to another.

jwatkins · May 15, 2024, 8:49pm

Serializing/deserializing JSON documents of 100 MB certainly takes more CPU time than simple 1KB documents. Not much to worry if you only have one such document, but it is possible that having several workflow tasks doing that concurrently may cause slowdowns, so that you eventually reach the 5000 ms limit. Have you looked at your CPU usage metrics?

By the way, from what you are describing, you don’t actually need the payload (i.e. the decoded and converted payload) in your workflow… Can’t you simply pass a pointer to the payload, without actually “loading” it in your workflow worker?

mhoc · May 15, 2024, 9:33pm

To the second point: we have an architecture that looks something like:

ParentWorkflow
|- MyWorkflow
|- ChildWorkflowTypeB
|- ChildWorkflowTypeC

So there’s a few different types of children that this parent runs, sometimes it’ll run one type, other times it’ll run multiple instances of one type, or even multiple instances of multiple types, concurrently. Some of these child types are more complicated than others, or do more things in the workflow versus in activities; but all the child types return the same schema of data. The MyWorkflow in my above comment is one of these child types.

So the parent fans out, waits for all the children to finish and return their T[], concats each child’s T[] into one big T[], then does Math And Such on that, within the context of ParentWorkflow. CPU utilization during that whole process certainly gets high; and it would not surprise me if the deadlock is being triggered by a JSON de/serialize.

Its a situation where; we could have these children return redis/s3/etc references to the data they generate; have the parent workflow collect all the references; then pass those references into an activity to do the Math And Such. My instinct is to feel a little sad at this, because while proxying large payloads into redis requires transferring lots o data and parsing all that JSON chews up CPU and replay is slow; what all that angles toward, I think, is just “its slow”. We’re totally cool with this process being slow, even very slow, for the 0.01% of our payloads that hit this size. The vast majority of the payloads this process generates are measured in kilobytes, and don’t even need to use the codec. Keeping the Math And Such within the workflow gets us some cool benefits like being able to see the data in the temporal dashboard (for smaller payloads at least), replay, etc.

Assuming its the JSON de/serialize; is there no way to adjust the 5000ms workflow timeout? Or, this is bigger think, but the root-cause potential for a large JSON de/serialize to trigger the 5s limit, feels weird? The workflow/worker isn’t dead; the CPU is actively doing stuff; it’d be cool to be able to interrupt that stuff to heartbeat or whatever the worker needs to keep that workflow going (but, that’s probably weird/difficult, at least in node/v8, i’m just brainstorming).

jwatkins · May 27, 2024, 3:37pm

Keeping the Math And Such within the workflow

What kind of math processing are you doing inside your workflow? That’s much more likely to explain a 5000 ms delay than the JSON serialization/deserialization, unless you have lots of large JSON files, or your worker machine is given a very low CPU budget…

One way or another, you should really move anything CPU intensive to activities. Doing that in workflow code is definitely not desirable. I understand that large payloads may be rare and that you wouldn’t mind those workflows executions to be slow… But when that happens, it may have consequences on all workflows executions running on that worker (i.e. not only the one with large payloads), causing storms of Workflow Task timeouts and cache evictions, and eventually, ripple out to have consequences on your whole worker fleet.

gets us some cool benefits like being able to see the data in the temporal dashboard

That’s nice, of course… If that’s really important to you, maybe you could consider using a “resolving payload codec” for your activity worker and in UI, but a “non-resolving payload codec” on your workflow worker…

Topic		Replies	Views
Blob Data Size too large, temporal dies? Community Support	1	744	September 19, 2023
Are activity heartbeat details meant to bypass PayloadCodec? Community Support python-sdk , activity	0	26	February 4, 2025
gRPC message size error getting swallowed up by retry middleware Community Support go-sdk	0	207	May 16, 2024
Adding codec and not breaking determinism of existing long running workflows Community Support python-sdk	3	1116	November 29, 2023
Cause of temporal deadlock errors Community Support python-sdk	3	977	January 26, 2024

Codec for handling large payloads encounters a script execution timeout after 5000ms

Related topics