Cassandra history_node table keeps growing

hi,

Cassandra history_node table keeps growing. We have set namespace retention of 7 days.
We observe this on both server versions 1.9.2 and 1.20.1

Did check this related thread.

Since we have configured, Cassandra as persistence store, we should not have encountered this issue? Or are we missing anything?

Appreciate any help.

Output of tctl commands

/etc/temporal $ tctl adm cl d | jq .persistenceStore
“cassandra”
/etc/temporal $ tctl --ns samples-namespace n desc
Name: samples-namespace
Id: 5a369311-26c0-4533-9b1d-8f54d23298b8
Description:
OwnerEmail:
NamespaceData: map[string]string(nil)
State: Registered
Retention: 168h0m0s
ActiveClusterName: active
Clusters: active
HistoryArchivalState: Disabled
IsGlobalNamespace: false
FailoverVersion: 0
FailoverHistory:
VisibilityArchivalState: Disabled
Bad binaries to reset:
±----------------±---------±-----------±-------+
| BINARY CHECKSUM | OPERATOR | START TIME | REASON |
±----------------±---------±-----------±-------+
±----------------±---------±-----------±-------+

@tihomir @maxim any suggestions here?

Did it stop growing after 7 days? Is Cassandra’s compaction configured correctly?

Did it stop growing after 7 days?

No. Its growing for months now. However, workflows older than 7 days are not visible in UI.

Is Cassandra’s compaction configured correctly?

We used the Schema creation script present in the github repo to create the tables.

CREATE TABLE temporal.history_node ( tree_id uuid, branch_id uuid, node_id bigint, txn_id bigint, data blob, data_encoding text, prev_txn_id bigint, PRIMARY KEY (tree_id, branch_id, node_id, txn_id) ) WITH CLUSTERING ORDER BY (branch_id ASC, node_id ASC, txn_id DESC) AND additional_write_policy = '99p' AND bloom_filter_fp_chance = 0.1 AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'} AND cdc = false AND comment = '' AND compaction = {'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'} AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'} AND crc_check_chance = 1.0 AND default_time_to_live = 0 AND extensions = {} AND gc_grace_seconds = 864000 AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 0 AND min_index_interval = 128 AND read_repair = 'BLOCKING' AND speculative_retry = '99p';

@maxim Is the tree_id the workflow id? How can we ensure that completed workflows which are crossed the configured retention period is not in history node table? Ideally the completed workflows after retention should be removed from the history node, isn’t?

Additional info:


Could you check the compaction configuration? See Compaction | Apache Cassandra Documentation for more information.

Also, you can try forcing compaction manually with nodetool compact

From the issue description, it sounds like autocompaction is not enabled or the gc_grace_seconds period is too large.

Btw, compaction requires ~2x disk size. So in your case it might require ~6T free disk space. Otherwise compaction might fail.

Compaction is configured and running correctly. There is 3x disk space, so no reason for compaction to fail.

How often delete statement is executed against the history_node table. temporal/history_store.go at 52b03657479941f60592163c0f2284a742d0fc84 · temporalio/temporal · GitHub

Is this part of scavenger workflow? Is there a known schedule for scavenger worfklows? Can it be configurable?

What’s your dynamic config value for “worker.historyScannerEnabled”?

We haven’t set this. It is default value.

hi @maxim ,

We found records in history_node table which does not have corresponding records in executions view.


Queries:

  1. Since we cannot find execution history for this run_id, should’nt this record be deleted in history_node table by scavenger workflow?
  2. None of the other tables have crossed 50 GB in size but history_node seems to have ~7 TB in size.

@maxim pls help support and suggest course of action.

Did you use reset command?

no, we did not.

@maxim is this an issue in Cassandra persistence of Temporal?

I"m not aware of such an issue, as we run many clusters that use Cassandra for persistence.

Could this be the reason?

Leveled compaction strategy does a better job of trying to keep data for a partition in a limited range of sstables, but if you wrote data some time ago and it’s aged into higher tiers and then come along later and do the delete, it can take a good deal of time for the delete to make it’s way up into the higher leveled tiers. This is the inherent problem with issuing deletes after the fact expecting to free up disk space. Issuing the initial write and any subsequent update with a TTL reduces the issue as the tombstone after the TTL has elapse is the original record as well so you avoid that nasty issue of having to wait for stable 1 to be able to compact with sstable 100

Our DBA’s are suggesting we alter the compaction’s subproperties especially the tombstone_threshold from 0.2 to 0.05 (Compaction subproperties)
@maxim pls let us know your thoughts on this.

Unfortunately, I’m not an expert on Cassandra’s compaction strategies.