We have set our history shards at 2048 and 3 replicas in kubernetes for the history service. These services are consistently using between 2GB-3GB memory each. Our retention period is 3 days. I am looking for reasons why the memory usage appears so high in the history services?
I noticed there is a tctl command
tctl admin db scan
which when run produces a lot of information about corrupted workflow executions. My questions are:
-
If there are a lot of corrupted workflow executions could this explain why the history service memory usage is so high i.e. history hanging around that is not being cleaned up?
-
What is the cause of corrupted workflow executions and is there a guide on how to avoid these with proper configuration?
-
Would running the scan and then the db clean resolve any of these issues. What does the clean actually do?
We are using cassandra for the DB engine.
Thanks.