History_node keeps growing

Kani · September 12, 2022, 1:37pm

I have been playing around with temporal. Created several workflow executions on a namespace with 3 days retention.
Workflow executions seems to be purged from the UI after that period of time, but history_node keeps growing, I don’t see it purged. Workflow executions have local and non-local activities and are completed in less than 1 minute (no long executions or strange timers).

Activities’s arguments and return values are complex objects (I know this is not recommended and I should change that), but table size is not stabilizing as expected after 4 days running on stabilized traffic.

Tried to build up a query over this table to understand if the purged is being done or not, but I failed.

Is there any metric / log that I can check to see if history_node is being purged correctly?

Thanks in advance!

tihomir · September 13, 2022, 3:12am

What’s persistence do you configure?
tctl adm cl d | jq .persistenceStore

What’s your set namespace retention?
tctl --ns <namespace_name> n desc

Temporal worker service does have a system workflow that removes execution info for closed workflows that have reached namespace retention, however currently it’s only enabled when using Cassandra persistence. Will keep checking for SQL and report back.

Kani · September 13, 2022, 12:32pm

It’s configured a sql datastore, namespace retention was 3 days, but I changed that yesterday to 1 day.

Temporal worker service does have a system workflow that removes execution info for closed workflows that have reached namespace retention, however currently it’s only enabled when using Cassandra persistence.

So on SQL datastore, Workflow Execution data remains forever? That’s a huge problem.

tihomir · September 13, 2022, 12:53pm

Working with server team to clarify this and will get back to you.

Kani · September 13, 2022, 1:10pm

The version of temporal we are using is 1.16.2

I have made a quick search on temporal code and I found temporal/scavenger.go at master · temporalio/temporal · GitHub, which seems like a cleaner task for history events. I’m not a good Go dev, but the comment here bring some light: temporal/scavenger.go at master · temporalio/temporal · GitHub

tihomir · September 13, 2022, 5:42pm

Was able to get more info on this
Yes “garbage history” cleanup does not happen for SQL currently like it does for Cassandra. This is something that will be addressed in future releases.
Was told however that the footprint of these “garbage” records that are not removed should be very small (should contain only the first event of the execution history) and it should not cause significant issues in almost all cases.

Kani · September 13, 2022, 5:46pm

Is there anyway I can track those records manually?

dankow · September 20, 2022, 11:07pm

I’m interested in any updates to this conversation. We use MySQL for persistence and have a history_node table that’s 1.6 TB and growing steadily.

tihomir · September 21, 2022, 3:41am

Thanks for reporting, which server version are you running?

dankow · September 21, 2022, 4:29am

This cluster is on 1.8. The size of the history_node table is one reason we haven’t upgraded past that version.

Kani · September 21, 2022, 12:30pm

We had made some changes on the workflow to avoid storing data not related to the workflow itself, that made an average reduction of 25% of the event-data. By doing so, I got rid of the message saying that the Blob size warning log, too.

On the other hand, I reduced the retention days of the namespace drastically.

Finally, after applying the below changes, we ended up with 50% of dead-tuples on that table, so we changed some configuration (I don’t remember which ones right now) over the VACCUM strategy for that table and the TOAST associated.

The table seems not to be growing so far, VACCUM seems to be triggering normally and the percentage of dead-tuples seems under control; but I am monitoring closely.

tihomir · September 21, 2022, 3:06pm

Team opened issue here for this to be addressed for SQL persistence.

Ryan_Qian · January 16, 2023, 3:11am

which version should I upgrade to if need this scavenger feature? can I use tctl adm db clean to cleanup db without upgrade temporal version(we are at 1.17.0 and mysql engine)?

@tihomir Or any manually cleanup sql I can used to cleanup history_node table? then I can upgrade temporal cluster later.

Topic		Replies	Views
How to curb & delete history but keep workflows running? Community Support	3	2886	November 22, 2020
Temporal History retention period Community Support database	2	7272	April 3, 2022
Domain history cleanup Community Support cassandra	3	2281	July 22, 2020
History_nodes not deleted after upgrade from 1.16.2 to 1.18.5 Community Support postgresql	3	672	November 28, 2022
Optimize history records for a workflow with all local activities Community Support java-sdk , cassandra	18	1994	September 2, 2021

History_node keeps growing

Related topics