Does the dotnet SDK have the Session API for complex fileprocessing

Hello, we are trying to manage some complex file processing using the dotnet SDK and hitting up against the activity payload limits in our ingestion workflow where we parse large amounts of file locations from zip files.

I’ve seen from other questions that you recommend looking at the file processing examples which are in Java and Go. These both use the Session API so data can be cached on the worker in memory and used there. From what I can tell the dotnet SDK does not have this functionality yet.

Is there something similar we can use that’s in the SDK?
Do we have a timeline of when the Session API will arrive into the SDK?
If there isn’t anything similar in the SDK we can use I am presuming you would recommend an external cache like Redis, is that the correct way to go about this?

Thanks for you time

We recommend using “worker-specific task queues” for this purpose in SDKs, see this sample. Sessions are an older API only present in earlier SDKs and can have confusing semantics regarding error/retry. Note, we have an issue open to demonstrate whole-worker-specific-task-queue-activity-set retries in this .NET sample that we’ll get to shortly, but basically it’s just a loop around the work on failure (can see it in the equivalent Go sample).

1 Like

Cool thank you for that! I was getting quite confused. Just to confirm and make sure I’m clear - the reason we would want this worker-specific task queues for file processing is so we can have in memory caches of large payloads we can’t pass through activities? For instance I could cache 1000s of storage locations (urls to files) in memory inside the worker specific activies class and and activity would be able to read from that?

File processing is good use case for worker-specific task queues. A good use case is when you must have multiple activities run on the same physical machine. But many do not use disk for large payloads, they use a blob store of sorts and in those cases there may not be benefit to forcing multiple activities to run on a single machine.

So if you’re operating on a 4GB file and you have a good reason to have separate activities to do the operations (e.g. they need to be independently failable/retryable/time-out-able units of work), worker-specific task queues may work. But if you’re operating on a 4GB file and you don’t have a good reason to do separate activities, you might as well do all the steps you need in a single activity (heartbeating while you do so). If you’re operating on a 4MB set of data, using a blob store is probably more reasonable instead of worker-specific task queues which limit the distribution of activities.

Hmmm thanks for the informed response. Will have to discuss with the team I think. The work is mostly downloading that data from a server and putting it in blob storage and then figuring out a few things before we can start individual workflows for each piece of data. The data we are keeping around is simply locations in blob storage. I don’t think in a lot of cases it will even reach 100mb.

That process does involve multiple stages of downloads and connections to the blob storage due to zip files being involved. So it would be nice if we could get the individual retry on the activities but I see your point about the one activity way of doing things.

Are you saying if we are working with a smaller amount of data it would be easier just to load that in and out of blob storage for the activities that need it?

Thanks for all your help! It’s been really great.

This is up to you. You’ll have to measure the cost of repeatedly downloading data for each activity vs forcing activities to operate on the same worker (via worker-specific task queues) that may have that data downloaded on disk from a previous activity.

If the cost of downloading is negligible (in perf and literal cost), I would encourage downloading as needed in each activity and therefore keeping activities self-contained and distributed and not concerned about where they run.