Our evaluation of Temporal has an important point that we need to figure out. We have an IoT setup, where several workers doesn’t have constant connection to the server.
Is it possible for a worker to receive a workflow execution command from the server, but after it’d loose connectivity to the server for some part of the workflow execution (we’re talking about several minutes)? Let’s say we have a robotic submarine which has a PerformExploration workflow (composed of local activities), it’s receiving a few waypoints as parameters which the submarine should visit, but unfortunately we’re not able to maintain connectivity during the mission. When the submarine returns to the ‘base’, connection is reestablished and it’d be able to sync results with the server and poll for the next task.
I have read through the documentation, and things described in the Heartbeat section seems to be promising. Assuming we’re setting a long enough heartbeat timeout on the server for these ‘disconnected’ workflows, the server would be ok. The question is whether the disconnected worker is good with such a scenario?
Thanks in advance
Andras
p.s: Of course we don’t expect miracles to happen, we understand that cancellation, signaling, queries, etc… are not possible while the device is off the radar.
PerformExploration workflow (composed of local activities), it’s receiving a few waypoints as parameters which the submarine should visit, but unfortunately we’re not able to maintain connectivity during the mission.
and things described in the Heartbeat section seems to be promising
Local activities cannot heartbeat, they are executed on the worker thats executing your workflow task in which they are invoked. Are all your activities local or you have a mix of local and regular activities?
Submarine worker losing connectivity would result in service timing out workflow task (or activity in case of regular activities) and issue a retry which would be dispatched to your sub worker when it comes back online so it can resume.
One thing maybe to consider is in your activities to check current position of the sub as in case of workflow task timeout, your local activities would be re-executed when worker comes back online, and similar with regular activities which would time out and service would dispatch a retry activity task but the activity code your worker is processing would keep running until it heartbeats or tries to report activiy failure / completion at which point it this call would time out.
You might also consider looking into changing default sdk grpc retry policy as the default (retry up to 1 min with a backoff) might not be optimal for your use case.
My idea is that the PerformExploration workflow is set up to execute on a worker running on a the submarine itself. My idea was to only use local Activities during the time while the device is disconnected. Since i know that the device battery enables missions no longer than n minutes, i’d set the workflow timeout to a bit bigger.
Assuming that the Orchestrator is not timing out the Workflow while the device is on mission, when it comes back online, will it be able to post the results and successful completion back to the server?
Also thanks on the suggestion to tweak the backoff policy of RPC calls! I’ll definitely consider adjusting that.
I don’t think local activities are going to help here. Workflow should not run on the submarine unless it has its own Temporal server running. The standard approach with IoT is to run workflows in the cloud and route activities to devices if necessary. Devices use signals and updates to communicate to workflows.
So in your case the submarine can accumulate the data while disconnected and then send it to its corresponding workflow in the form of a signal when reconnected.
Thanks for the insights! Why i was pushing for the submarine to have it’s own workflows, because it’s a pretty autonomous device doing a lot and complex processing. In other words, it could benefit from Durable Execution capabilities.
What i had in mind is a combination of an Entity Workflow running in the cloud (let’s call it MissionPlanner), which is receiving ExplorationRequests (waypoints to visit). Would break them down to convenient-sized missions and would schedule them for the submarines, by spawning a child-workflow (let’s call it ExecuteMission) and waiting for completion (success, timeout, failure, etc). With these results, it can compose ExplorationResult that the end-user will browse.
The ExecuteMission workflow can run on the submarine and after some time with starting it, the submarine will loose connectivity to the Server. Based on the documentation about local activities (Local Activity | Temporal Platform Documentation), i had the impression that if the ExecuteMission workflow only spawns local activities while disconnected, they are going to be scheduled to the worker’s task queue and no server connection is needed. And before ExecuteMission finishes, connection will be available again so results can be communicated back.
Based on your comment, i have the feeling that my line of thinking is having a weak spot, could you please help me understand why exactly this approach would encounter issues?
You mentioned another interesting alternative: the submarine could run it’s own Temporal instance, then disconnection is not an issue. Is this a usecase for Temporal Nexus to integrate submarine-logic and cloud-logic?
Local activities weren’t designed for your use case. They also don’t provide any durability if there is no connection to the backend Temporal service.
Yes, I would use Nexus to integrate multiple Temporal clusters. If the IoT device is powerful enough to run its Temporal service (I assume the single binary one), then go for it.