Errors we are observed:
Writing large partition temporal_visibility/closed_executions.closed_by_type:SingleItemIngestWorkflow (101.425MiB) to sstable /var/lib/cassandra/data/temporal_visibility/closed_executions-c384497053b711ecb839273e9d7d301f/.closed_by_type/me-167-big-Data.db
Observation as part of this load:
Recently we observed lot of permanence issues with cassandra. Currently our cassandra is configured as 5 nodes with 3 Replication center. Temporal connects to cassandra by mentioning at least 3 or 4 nodes (“XX.XXX.XX.5,XX.XXX.XX.6,XX.XXX.XX.8, XX.XXX.XX.9”) of the data center.
We are seeing huge prepared statements on only two nodes ending with node 6- 800 Million,node 8- 1250 Million remaining other nodes close to 200 million combined and the distribution is not happening across all nodes. Node 8 is running out of CPU most of the times.
Why are we seeing so many prepared statements ( we executed only 2.5 million workflows with average of 6 actions) ? why is distribution of prepared statements not happening across all nodes? As per my understanding data distribution is happening based on partition key. In my case the data is mostly coming from only 2 namespaces in temporal . Is this causing the load to send and receive from only certain nodes
Also As of now we are using only Cassandra to connect temporal and temporal visibility key spaces. If we have elastic will data go to elastic for temporal visibility instead of cassandra?