We use a self-hosted Temporal server and Python SDK.
For some business processes, we have to start a workflow in transactions to ensure data consistency. For example:
start a transaction
write in DB;
write in DB again;
start workflow (RPC call to Temporal server);
write in DB again;
commit the transaction.
We considered a few options:
Leave the RPC call in a transaction, but keep it as close to the end as possible. This approach helps to reach atomic execution in almost every case, but it’s not working if there is more than one workflow has to be started.
Also, it raises the amount of “idle in transaction” DB connections, which can be a major problem.
Move the RPC call out of a transaction. It can harm consistency (the data in the DB has been changed, but the workflow hasn’t started).
Transactional outbox pattern: save the message in the database in the same transaction and set up a dedicated process that going to read from the DB and write to the Temporal server. I didn’t find any existing connector implementation for that, so it will definitely work, but it takes time to implement, test, adopt, and maintain.
Is there any recommended way to ensure the atomic workflow start-up process?
PS: We’re so concerned about “start_workflow” failing not just out of pure curiosity, but because we have a considerable amount of long-running requests.
Could consider triggering the workflow and have it perform db writes via activities. While your workflow exec is running, Temporal guarantees its uniqueness (no possible to have more than one execution running with same workflow id on same namespace at the same time)
I think you may also need to consider client crashes. For example suppose your client crashes between starting the workflow and committing the transaction. Now you’ve started the workflow but that isn’t recorded in the database.
With the transactional outbox pattern, you can make the RPC call to start the workflow idempotent (nothing happens if the workflow was already started). Now if the outbox processor client crashes between starting the workflow and clearing the outbox message to start the workflow it’s not a problem, the client can retry and a second call to start the workflow doesn’t do any harm.
With @tihomir’s suggestion the workflow could update the database to say “I’m now running”. That also can be idempotent (two writes to the database saying “I’m running” leaves the database in the same state). With durable workflow execution the activity to write “I’m running” to the database can be automatically retried until successful.
With two separate systems like this you can’t cause both of them to be updated simultaneously (to start a workflow and record that in a transaction at the same time). What you can do is ensure that state is communicated reliably between systems. With the outbox pattern, you record in your transaction your intent to start the workflow, and then with the outbox processor pattern the workflow will be started eventually even in the presence of crashes. Likewise if you have the workflow record its state in the database, that database update will be written the the database eventually even if there are crashes.
From the point of view of the database, you can reliably, consistently and atomically record events: “we have made the decision to start the workflow” and “the workflow has started”. In between there will be some period of time where the workflow hasn’t started yet or the workflow has started but that fact hasn’t been recorded in the database yet. What you do know is that after the “we will be starting the workflow” event the workflow will be started, and after the “the workflow has started” event the workflow has started.