Causes and solutions for DATA_LOSS errors

Version: v1.13.0

After rolling back the database in the production environment, the following two types of errors began to occur frequently on the client side.

  1. io.grpc.StatusRuntimeException: DATA_LOSS: Incomplete history: expected events [1-4] but got events [1-3] of length 3: isFirstPage=true,isLastPage=true,pageSize=256 at io.grpc.stub.ClientCalls.toStatusRuntimeException

  2. io.grpc.StatusRuntimeException: DATA_LOSS: corrupted history event batch, eventID is not contiguous at io.grpc.stub.ClientCalls.toStatusRuntimeException

Why is these error occurring, and what steps should be taken to resolve it?

I attempted to reproduce the issue in the development environment but were unable to do so.

At the same time, the CPU usage of the database has increased, causing further issues.

Temporal relies on DB being fully consistent. It looks like db rollback left the DB in an inconsistent state. At this point the best option is to recreate DB. If this is not possible you can try deleting specific workflows that broke.

1 Like