Our workflow has a sub-process, in which it will call an activity (check delivery status), check the result, then wait (sleep) for 3 hours and repeat the process if delivery is not success. The loop will go on until 2 days have passed:
The sleep is implemented using workflow.Sleep(ctx, subProcessDurationLeft). In Temporal Web UI, the subprocess will look like (image 2).
In production environment, there was a Cassandra server migration activity. It affected some running workflow transactions. Some transactions encountered WF Task Timed Out (image 3).
We gathered some error messages from Temporal Server log:
- Corrupted non-contiguous event batch
- Operation failed with internal error.
- encounter data loss event
- unavailable error, data loss
When Cassandra server was finally available (running), the transaction could run normally (image 4).
However, when the next sleep timer was triggered, non-deterministic error occurred (image 5).
We suspected that the sleep timer conflicted. When Cassandra Server was shut down, the timer was continued only after Cassandra Server was available, and it was already the time for another sleep timer to begin.
We tried to replicate this issue in NFT scenario, we tried to shut down Cassandra Server, and Temporal Server. However, no non-deterministic error occurred. The transaction ran normally when the servers were available.
Is there any way to find the root cause of the NDE in production? Is there also any way to prevent this scenario from happening again?
Go versions:
go.temporal.io/api v1.6.1-0.20211110205628-60c98e9cbfe2
go.temporal.io/sdk v1.13.1