Persistent store operation Failure

We are seeing the below error log in our history pods

Persistent store operation failure, failed to lock shard. Previous raneg ID : 1234123;

Questions

  1. What does this error mean?
  2. We are experiencing slowness in our env, can this be the reason?
  3. How to fix the it?

FYI : This seems to be happening after we enabled archival. Also in the temporal-system namespace we see a thousands of temporal-archival workflows running.

I disabled archival and the temporal is stable now.

Questions

  • Why did archival create thousands of workflows in temporal-system namespace?

Persistent store operation failure, failed to lock shard. Previous raneg ID : 1234123;

This is usually a configuration error, with likely causes being multiple clusters trying to use same db, or same bootstrap hosts (ringpop). And yes it can be a possible cause of slowness imo.
Would check your server config.

I am not sure about the archival questions, any way you can help me reproduce this?

On further investigation, I found that multiple archival worlfows were created to archive a single workflow.
And all of them were running simultaneously.

To reproduce,

  1. Setup S3 Archival
  2. Create some workflows.
  3. Check the archival workflows in temporal-system namespace.

Would like to know how many archival workflows are created in temporal-system namespace when you try.

Also, if you don’t find any anomaly, try to set wrong creadential for s3 connectivity and try, wondering if that would create multiple workflows.

Tihomir,

Could you help me understand a little on the internal working on Archival?

Question

  1. In an ideal case how many temporal-archival workflows should be running in temporal-system namespace?
  2. When should it increase to more than one?
  3. Is it correct to say temporal-archival workflow receives signal with the data to be archived and the workflow proceeds to archive it as an activity?
  4. What if the signal sent is not received? (Asking this as I know errors stating that), would that result in creation of a new temporal-archival workflow assuming there doesn’t exit one already?
  5. Is temporal archival feature still in experimental stage?