Large parameter into many activities

temporal_user · September 6, 2023, 8:08pm

I am planning a workflow that looks like this:

Activity 1: get a large list of products
Activity 2-N: given the given the large list of products, do something.

However, I realize this will grind up the event history storage.

Since activity 2-N are operating on the same large list, is there another (temporal native) way they can get the list without passing it in through the params?

maxim · September 6, 2023, 8:33pm

Two options:

Pass this list through some external system like Redis or S3.
Cache the list at a host (or process memory) and route activities to that host. If the host goes down, retry the whole sequence from the beginning. See the fileprocessing sample for details.

temporal_user · September 6, 2023, 8:48pm

Thanks Maxim,

Is there an option 3 - query the event history in subsequent activities myself?

temporal_user · September 6, 2023, 8:53pm

If I go with option 1 - what would be the recommended way to handle the workflow being replayed and the list being cleared from the external system. In this scenario, Activity 1 which initially get and set the list in the external system is successful so wont be replayed (if my understanding is correct).

Would it be:

to have each subsequent activity get and set the list if not present in the external system?
Can I have a “get if not set” activity which is always replayed?
A way to specify the workflow needs replayed from the beginning for this case?

maxim · September 6, 2023, 10:48pm

For (3) you will hit the single activity result limit of 2mb. Also there is gRPC request limit.

For 1 you can have an activity at the end of your workflow that cleans up the data from the external system. I’m not sure how it is related to workflow replay which doesn’t reexecute activities.

temporal_user · September 6, 2023, 11:06pm

Thanks for the response.

I think I did not convey my concern well, it was not about cleanup but how best to handle something going wrong with the external state, like the value not being found in the redis cache in a subsequent activity.

maxim · September 7, 2023, 1:09am

Either retry the whole sequence from the beginning or use reliable storage like S3 instead of Redis.

I personally think the (2) is more efficient.

temporal_user · September 7, 2023, 2:35am

Sounds good.

One more question as I consider all the options -

If I can guarantee the list does not exceed the activity result limit or gRPC request limit, would pulling the list out of the event history with GetWorkflowHistory go sdk call and caching it in process memory be a sound approach?

The advantages being

Do not inflate event history by passing in to subsequent activity params
No need for s3 or redis. Not that setting them up is hard but it is nice to use a single transactional system for the state
No need to only route to single worker, each worker can pull the list out of event history one time then keep it in process

Do I misunderstand the benefits/are there gotchas?

Thanks for the help

maxim · September 7, 2023, 3:07am

If you cache it in the process memory, you must route activities to that worker anyway.

BTW, here is the feature request to support your use case without history growth overhead.

Topic		Replies	Views
Activity parameters and returns values impact on performance Community Support activity	6	3655	December 8, 2023
Batch Processing Best Practices Community Support	4	3433	December 9, 2020
Ability to prevent the argument of Local Activities to be store in history Community Support go-sdk , mysql	2	696	December 17, 2020
Design for coordinator workflow with potentially large history Community Support go-sdk , cassandra	2	988	August 31, 2021
Batch-Jobs in Cadence Community Support cadence	11	2302	April 22, 2021

Large parameter into many activities

Related topics