I am planning a workflow that looks like this:
Activity 1: get a large list of products
Activity 2-N: given the given the large list of products, do something.
However, I realize this will grind up the event history storage.
Since activity 2-N are operating on the same large list, is there another (temporal native) way they can get the list without passing it in through the params?
Is there an option 3 - query the event history in subsequent activities myself?
If I go with option 1 - what would be the recommended way to handle the workflow being replayed and the list being cleared from the external system. In this scenario, Activity 1 which initially get and set the list in the external system is successful so wont be replayed (if my understanding is correct).
Would it be:
- to have each subsequent activity get and set the list if not present in the external system?
- Can I have a “get if not set” activity which is always replayed?
- A way to specify the workflow needs replayed from the beginning for this case?
For (3) you will hit the single activity result limit of 2mb. Also there is gRPC request limit.
For 1 you can have an activity at the end of your workflow that cleans up the data from the external system. I’m not sure how it is related to workflow replay which doesn’t reexecute activities.
Thanks for the response.
I think I did not convey my concern well, it was not about cleanup but how best to handle something going wrong with the external state, like the value not being found in the redis cache in a subsequent activity.
Either retry the whole sequence from the beginning or use reliable storage like S3 instead of Redis.
I personally think the (2) is more efficient.
One more question as I consider all the options -
If I can guarantee the list does not exceed the activity result limit or gRPC request limit, would pulling the list out of the event history with GetWorkflowHistory go sdk call and caching it in process memory be a sound approach?
The advantages being
- Do not inflate event history by passing in to subsequent activity params
- No need for s3 or redis. Not that setting them up is hard but it is nice to use a single transactional system for the state
- No need to only route to single worker, each worker can pull the list out of event history one time then keep it in process
Do I misunderstand the benefits/are there gotchas?
Thanks for the help
If you cache it in the process memory, you must route activities to that worker anyway.
BTW, here is the feature request to support your use case without history growth overhead.