I haven’t played around with resets much, but it by glancing at it, it seemed to be a good fit, until I read posts like these which say it’s to be avoided normally etc. Thought of posting here to seek guidance.
Eg. the workflow is as follows:
activity-0 -> Produces Y
activity-1 -> Takes Y, fetches X from somewhere, then emits that X as output
activity-2 -> Takes X, does a bunch of stuff, passes X along (maybe with other data, but not relevant here)
activity-3 -> activity-6 -> Do something similar
activity-7 -> Takes X and tries to do a bunch of stuff <<<<< Point of failure
activity-8 .....
So this is fairly straightforward (linear) workflow to just demo the general use-case. When activity-7 fails, say we need to wait for user intervention. The user decides that they want to redo the part which fetched X since they have now rectified something elsewhere and X has been modified. So I want to give them an ability in my app to say something along the lines of “Forget all that happened since activity-1 (inclusive) just redo everything from there”. That way we start with fetching (the rectified) X again (everything before activity-1 was good so we want to retain all that state) and everything’s fine this time and the workflow completes successfully.
So reset from some point <xyz> in the workflow seems like a feature that would allow me to do this. Ofc. there will be many other kinds of failures which may need resetting from different points (and in some cases might make sense to just terminate and restart the entire worflows afresh), but that is up-to the user.
Is this something Reset is intended for (ie. making it the main part of our application which is running temporal behind the scenes)? If no, then what would you recommend in this case.
Are there any drawbacks/caveats that I should be aware of while advertising such resets as an accessible feature to the users of my app?
If the users say they want to redo the workflow from activity-1 how should I invoke Reset to do that? I’m coding in python if it matters. From what I see, the temporal-UI only presents the various workflow-task-completed events to choose from but not activity names. I’m assuming it would be similar for CLI/python-client. So if only have the activity name I want to restart the workflow from (assuming all names are unique), how do I map to the workflow-task-completed event that then led to the invocation/scheduling of this activity? There can be branches etc ie. activity-1 could have siblings that got scheduled in parallel after activity-0 completed.
Is this something Reset is intended for (ie. making it the main part of our application which is running temporal behind the scenes)?
imho no, would not make reset feature a big part of your business use case. it is indeed like a “get out of jail card” feature, which currently still has some limitations (which are being worked on from what i understand however).
If no, then what would you recommend in this case.
typically you would use exception handling to handle activity failure and depending on failure type for example decide what to do. for example wait for “ack” user sends with signal to denote what to do next and then could continueasnew passing in info to continued execution step where to continue and any state from the execution where activity failed. so basically use business logic instead (could consider using interceptors if impl of your use case can be done more generically)
Sorry I didn’t fully comprehend. To summarise my problem again: There is a prior activity activity-1 which gets some external data (say over http). The downstream activities then work on this data. However at activity-7 (ie much later), things in the application fail and the activity throws and exception. So it’ll be retried indefinitely as usual (assuming no retry upper limit etc). At this point the user decides the only correct way is to change the external data which temporal workflow had fetched in activity-1. Once that is changed, all the work done by activity-1 through activity-7 is now invalidated (this is how the business logic works) so the we must ask temporal to reset back to activity-1 which will then fetch the updated data from external source afresh and thus its output would change and all the downstream activities will now work of this new data.
However there might have been a lot of work done priory to activity-1 which eventually resulted in the input to activity-1. This should not be changed. Reset seems to do exactly that. If I indicate the reset point to the workflow-completed event just before the activity-scheduled event for activity-1, temporal starts a new workflow run, with all the prior steps being a NoOp and producing the exact same outputs up until activity-1 and from that point onwards, the workflow progresses as-if this were the first time it was executing all those activities.
This is the rewind-and-replay strategy that I want to implement.
Why is Reset bad for this?
How would continue-as-new help in this case (if you can let me know how it can rewind-and-replay from a previous arbitrary step) ?
Reset in itself is very helpful feature, agreed but imo still should be used when unexpected failures happen, rather than being used as part of core business logic flow. It also currently as mentioned has a number of limitations (being all worked on) which make it difficult to use in a number of scenarios at this time. To use reset you also need to understand event history to know where to reset the execution to.
For 2) regarding
So it’ll be retried indefinitely as usual (assuming no retry upper limit etc).
At this point the user decides the only correct way is to change the external data which temporal workflow had fetched in activity-1
How do you intend to monitor and alert users on activity retries which they need to look at and then decide upon? This is going to be pretty difficult to do on large scale as it seems very proactive.
Would rather either restrict activity retries, or raise non-retryable failure from activity based on activity attempt count or certain failure type you get from your downstream service, handle this failure in workflow code, then notify your user (via another activity) to provide their decision (by waiting on their signal for example)
depending on user decision then you can either fail the execution
or use continueasnew passing in data (as inputs) to new execution to tell it where to continue from and if needed activity-1 calculated inputs so it does not have to re-calculate them again.
It also currently as mentioned has a number of limitations […]
Do you happen to know some off the top of your head? I went ahead and implemented it in my application/project and it does everything I was hoping for. I’ll look into your “continue-as-new” suggestion too a bit later and see how the approaches compare.