We use a self-hosted Temporal server and Python SDK.
For some business processes, we have to start a workflow in transactions to ensure data consistency. For example:
start a transaction
write in DB;
write in DB again;
start workflow (RPC call to Temporal server);
write in DB again;
commit the transaction.
We considered a few options:
Leave the RPC call in a transaction, but keep it as close to the end as possible. This approach helps to reach atomic execution in almost every case, but it’s not working if there is more than one workflow has to be started.
Also, it raises the amount of “idle in transaction” DB connections, which can be a major problem.
Move the RPC call out of a transaction. It can harm consistency (the data in the DB has been changed, but the workflow hasn’t started).
Transactional outbox pattern: save the message in the database in the same transaction and set up a dedicated process that going to read from the DB and write to the Temporal server. I didn’t find any existing connector implementation for that, so it will definitely work, but it takes time to implement, test, adopt, and maintain.
Is there any recommended way to ensure the atomic workflow start-up process?
PS: We’re so concerned about “start_workflow” failing not just out of pure curiosity, but because we have a considerable amount of long-running requests.
Could consider triggering the workflow and have it perform db writes via activities. While your workflow exec is running, Temporal guarantees its uniqueness (no possible to have more than one execution running with same workflow id on same namespace at the same time)
I think you may also need to consider client crashes. For example suppose your client crashes between starting the workflow and committing the transaction. Now you’ve started the workflow but that isn’t recorded in the database.
With the transactional outbox pattern, you can make the RPC call to start the workflow idempotent (nothing happens if the workflow was already started). Now if the outbox processor client crashes between starting the workflow and clearing the outbox message to start the workflow it’s not a problem, the client can retry and a second call to start the workflow doesn’t do any harm.
With @tihomir’s suggestion the workflow could update the database to say “I’m now running”. That also can be idempotent (two writes to the database saying “I’m running” leaves the database in the same state). With durable workflow execution the activity to write “I’m running” to the database can be automatically retried until successful.
With two separate systems like this you can’t cause both of them to be updated simultaneously (to start a workflow and record that in a transaction at the same time). What you can do is ensure that state is communicated reliably between systems. With the outbox pattern, you record in your transaction your intent to start the workflow, and then with the outbox processor pattern the workflow will be started eventually even in the presence of crashes. Likewise if you have the workflow record its state in the database, that database update will be written the the database eventually even if there are crashes.
From the point of view of the database, you can reliably, consistently and atomically record events: “we have made the decision to start the workflow” and “the workflow has started”. In between there will be some period of time where the workflow hasn’t started yet or the workflow has started but that fact hasn’t been recorded in the database yet. What you do know is that after the “we will be starting the workflow” event the workflow will be started, and after the “the workflow has started” event the workflow has started.
Also note that this pattern will start the workflow even if the database needs to abort and retry the database transaction, leaving you in an inconsistent state where the workflow has started but the database doesn’t know that happened.
There is however another pattern which can be easier than going to a full-blown transactional out-box pattern.
Consider that your database client may crash. The database will detect that the database connection has closed abruptly and abort the transaction. Since your client might crash, if you want to ensure that your business process runs, you need to retry running your database client if it crashes.
As is generally true for distributed systems, your client might crash after the database successfully applies the transaction but before the client records that the transaction was successful. This means that when you rerun the client, it will need to run the same transaction again if you want to ensure that your business process always runs. The solution to this is to make your database transaction idempotent: running the database transaction once, twice, or multiple times should result in the database ending up in the same state: that is, running the transaction a second time should have no effect on the database. If you’re using data returned by the transaction in your client, running the same transaction again should return the same data.
Once you have this in place, which is a requirement if you want to ensure that your business process always executes in the presence of crashes in a distributed system, you don’t need the outbox pattern to consistently start a workflow. You can, if you want to, but you don’t have to. The client can take the results of the database transaction, after the transaction has completed successfully, and use that to start the workflow. Assuming that you’re using a workflow id tied to the business process, starting a workflow can be an idempotent operation: starting the same workflow when it’s already running can be a no-op.
Now you have a fully reliable, consistent, and durable execution strategy. Even if the client completes the database transaction, starts the workflow, and then crashes, it can rerun, get the same results from the idempotent database transaction, and then ensure the workflow is started (which in this case will do nothing because the workflow has already started).
Also note that for automatically reruning a process such as our database client, a Temporal workflow can be a convenient way of doing that. You don’t have to use a Temporal workflow to do the database transaction , but you could, and it could save you the effort of creating a mechanism for reliably rerunning the client if you don’t already have that. The coordination workflow would call an activity which performs the idempotent database transaction, and then start the business process workflow, again in an idempotent fashion.
There’s still some complexity left, which is ensuring that your database transaction is idempotent. Which you’d need to do anyway, even if you weren’t using Temporal, if you wanted to ensure that your business process would always run.