I’m reading the documentation about synchronous updates, and it’s a little confusing. some questions:
what is the threading model? is it the same as signals and main function, meaning no concurrent code execution guaranties? I assume yes but the release notes mentions a concurrency config value:
history.maxInFlightUpdates controls the number of Updates that can be “in-flight” (that is, concurrently executing, not having completed) for a given Workflow Execution. The default is 10.
The release notes also mention max update events:
history.maxTotalUpdates controls the total number of Updates that a single Workflow Execution can support. The default is 2000
why are update events treated differently from other events?
Do updates execute in the order they are received, like signals?
In this quote from the docs:
You can think of an Update as a synchronous, blocking call that could replace both a Signal and a Query. An update is:
* A Signal that can return a value, and has lower overhead and latency
* A Query that can mutate workflow state
In what way an update has lower overhead and latency compared to signals?
How does one make sure all update handlers have completed before calling cont-as-new:
To avoid losing Updates when using Continue-As-New, ensure that all Update handlers have completed before calling Continue-As-New.
If anyone knowledgable can share some deeper insights on the mechanics of updates that would be highly appreciated.
To answer your first question, workflows are single-threaded. Conceptually, a workflow will have a list of “tasks” to do, i.e. executing a next step in the main function, responding to signals, and responding to updates. The workflow will process these one at a time.
“In-flight” updates refer to updates that have been requested of the workflow but haven’t been processed by the workflow yet.
Note that because workflows solely handle the logic of an application, they will spend most of their time waiting for something to happen, and then they should very quickly execute in response to an event, and then go back to waiting. So in normal operation you shouldn’t see much of a backlog anyway.
If you do see a backlog, that would mean either:
you’re not running enough workflow workers (check to see if the workers are overloaded)
your workflow is doing CPU intensive operations or IO (it shouldn’t, use activities instead)
you’re sending too many events to the workflow too quickly (I’ve heard “a few per second” and “up to ten per second” as guidelines as to how quickly a workflow can typically be expected to be able to respond to events).
Thanks @awwx.
Yes, I’m aware of all that, I was just pointing out that the documentation about updates could be clearer in that respect.
@maxim maybe you can shed some light here? Can you explain the mechanics of updates and how it is different from signals (apart from returning a value to the caller).
Specifically, if you can explain how come updates have lower overhead and latency than signals, that would be very helpful.
Specifically, if you can explain how come updates have lower overhead and latency than signals, that would be very helpful.
Scenario
A request is received
A local activity that handles the request is executed
A timer is scheduled with the interval based on the result of local activity
Using Signal
A signal is sent to the workflow, recorded into the workflow history, and a new workflow task is scheduled. This takes a single state transition (db update).
The workflow task is started. 1 state transition
The worker executes the signal handler and the local activity and schedules a timer. The workflow task completion is recorded. 1 state transition
Total is three state transitions
Using Update
An update is sent to the workflow and a new speculative workflow task is scheduled. This doesn’t require any state transitions.
The speculative workflow task is started. 0 state transitions.
The worker executes the update handler and the local activity and schedules a timer. The workflow task completion is recorded. 1 state transition
Total is one state transition.
Comparison
So, in scenarios when an update handler can be completed within a single workflow task, the cost of the signal is three times higher than the cost of the update.
In addition, an update can be rejected (using a validator function). A rejected update is not recorded into workflow history at all. It is not possible to reject a signal. So there are edge cases when too many signals can cause a workflow to exceed it history size limit.
Thanks @maxim that helps! So if I get it right, updates are only recorded to event history after (and if) they are executed, unlike signals which are recorded before hand. Is that correct?
A few follow up questions:
In what scenarios would one prefer a signal over an update (assuming the update can be dispatched async - meaning without waiting for the result)? I’m assuming the answer is that with a signal you get strong guarantee the signal is recorded and will be executed, while with an update you have to wait for its result (or rejection) to be confident it wasn’t lost. Right?
The documentation mentions draining all pending updates before continue-as-new to prevent them from being lost. How do I do that with the java SDK?
If there are pending updates when continue-as-new is called, will the caller be unblocked immediately with an error or will it hand until timing-out?
Many thanks for helping me wrap my head around it!
@maxim any chance to get your feedback on that?
Perhaps the best way for me to understand updates is to see a working example of a real use case. Is there such a thing somewhere?
I would use Signal when the latency of processing is not as important, and there is no need to get any result back, which includes failures to validate the input.
Make sure that this structure is empty if you buffer updates in some data structure inside the workflow when completing the workflow (which includes continue-as-new).
Yes, the caller will be unblocked with an exception. (This might not be released yet).