Hi, thanks for such a good list of questions!
Q1.1. Should a workflow be created per shopping cart or should the workflow be created only once checkout has started?
Having always running workflow per shopping cart allows implementing all sorts of features like notifying a customer that the price of some items in his cart changed. The checkout flow looks like a separate workflow to me. BTW have you seen this series of blog posts that talk about shopping cart workflow?
Q1.2. Should the cart contents be persisted by the web-layer to a database and the identity of the cart be published to the workflow? Or should the cart contents be submitted to the workflow which is then responsible to send the information to the different activities (through signals)? Changes to the cart via signals?
You can use both approaches. I believe that storing all the information in the workflow leads to much simpler implementation. See this blog post.
Q1.3. Where a web application server is integrating with the backend layer, should the appserver trigger workflows directly? (In J.EE it would be in some controller, or it may be a rest service that starts a process, or some UI application).
I’m not expert on web application servers, but I believe it should trigger/signal/query workflows directly.
Q1.4. How would a workflow respond back a result to an originating webserver which might push a result back to a client (HTTP2/websocket/etc)? The final step in a workflow would call an activity which would be responsible to trigger a callback to a server which is responsible to trigger the HTTP2 push/Websocket push/Diameter push/etc. The actual callback destination would need to be passed into the workflow either by configuration or as a parameter. This is almost obvious.
Yes, an activity can be used to push the result back. Temporal supports activity task routing to specific hosts. So the result activity can execute on the same host that received the request.
Q1.4.1. Just confirming that there is no stateful client identifier that can be used for bidirectional communication back to that specific initiating server.
Not yet. But using task routing this can be implemented. We also plan to add support for synchronous updates directly through Temporal service API. You can think about them as queries that are allowed to update workflow state.
Q2.1. Should a workflow be create per user for which bank collections are taking place or should the workflow define and manage the bulk steps only? In the scenario there are 500,000 accounts for which bank collections must take place. That would mean there may be 500,000 workflows publishing information to some sort of collector which publishes a batch file to a bank and then receives a response for processing.
I would create workflow execution per account in this case. We tested the system to hundreds of millions of open workflows. So half a million is not that much.
Q2.2. If there are a number of accounts to be processed (e.g. 543,212), where a workflow is initiated to trigger activities that pick up 543212 accounts and split those accounts into 500 groups of 100 which then has a workflow each which then manage 1000 accounts. Control is effectively top down if the workflows are managed as “child workflows”. Is this a viable pattern?
It depends. In the majority of cases, such batch jobs are better implemented by not starting all workflows simultaneously. I would use iterator workflow pattern. A workflow can load a page of account ids through an activity. Then execute child workflows for all ids in the page and then call continue-as-new with the last page token. The next run of the workflow does the same with the next page. This allows limiting parallelism.
Rarely we want to start all the workflows as each of them can take a long time. Then using a tree of child workflows is the way to go.
Q2.2.2. In the scenario where information is published into the workflow, and a “data” fix is needed to be applied by an operations team due to unforseen issue, how should this be achieved? (I see this is effectively the downside of managing the information within the flow)
It depends on the use case. In some cases, a signal can be used to update “data”. In others, the reset feature that rolls back a workflow state to some point of its execution is more appropriate.
Q2.3. If there are a number of accounts to be processed (e.g. 543,212), each has its account code on a message queue. An activity will process the queue and publish 1 workflow per message. I want to spawned a workflow once all of the dependent workflows have been successfully processed. How do I achieve this?
In this case, you don’t want independent workflows, but use either iterator pattern or a tree of child workflows.
Q2.4. In Q2.3 we advocate 1 workflow for 1 account. This might include activities like communicating to customers triggering workflows to update bank details, and have different workflows per bank type, etc. Is the above an anti-pattern and we are using the wrong tool for the job or is this an acceptable use case?
This use case looks like a good fit for Temporal.
Q3.1. When 100,000 workflows submitted by customers are failing because of a dependent system failing, I want to ensure that, after recovery, all newly created workflows submitted by customers get processed before the workflows that are in a failed state. How to achieve this without having delays on retries? I want to fail fast but then resume eagerly with lower priority to newly created workflows.
I don’t think you should have any workflows in failed state if some downstream system is down. They all should keep retrying the failed activity until the dependency is fixed.
Temporal doesn’t yet support such dynamic priorities. For static ones, you can use different task queues for different priorities.
Q3.2. When 100,000 workflows are submitted, and I want to find a subset of these to be influenced/patched. I understand that we need elastic search to be able to query on a fine grained subset or we have the state of the workflow publish to an aggregate table. What would be the best way to modify/patch the dataset? From my understanding this is not possible. We would have to release a new version of code and then have that code either accept a signal or introduce an additional activity to introduce new information.
I’m not sure what you mean by modify/patch dataset. Usually a signal is used to notify workflows about changes in the external world. You can use the batch signal feature to notify multiple workflows with a single command.
Q3.3. When introducing new versions, my understanding is that workflows are “replayed” based on the stored state. Say I had the following.
…
If the workflow failed on Beta. The reprocess does not actually “remove” the second Alpha step. This happened already. In the new version, how does the playback of the “removed” alpha a2 impact the result of alpha a7? Will a7 be the same as a2 ?
You have to keep both versions of the code until all the workflows that used the original version are still present in the system. See this documentation page that explains how versioning works.
Q3.4. When a process fails on the Beta step below, and the activity Alpha needs to be rerun as an underlying data-fix has happened. How to can we restart the workflow to process from the Alpha step below? My understanding is that when beta throws an exception it would resume on the beta step since the results of the Alpha step has already processed.
Use workflow reset (tctl v1.17 command reference | Temporal Documentation) to rerun part of the workflow.