History service memory usage

ravikiran · August 3, 2022, 5:07pm

hi,
We are seeing gradual increase in memory usage of History nodes over last few days (from 15% to 55%)

Following errors are seen continuously in the logs

{“level”:“error”,“ts”:“2022-07-26T07:08:01.073Z”,“msg”:“Unable to get namespace”,“service”:“history”,“shard-id”:544,“address”:“11.16.168.138:7234”,“shard-item”:“0xc00229d380”,“component”:“timer-queue-processor”,“cluster-name”:“active”,“component”:“timer-queue-processor”,“error”:“namespace ID: b6957493-3e3f-4c8b-9e0d-bba9d46c5173 not found”,“logging-call-at”:“taskProcessor.go:347”,“stacktrace”:“go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:136\ngo.temporal.io/server/service/history.(*taskProcessor).getNamespaceTagByID\n\t/temporal/service/history/taskProcessor.go:347\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskOnce\n\t/temporal/service/history/taskProcessor.go:259\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck.func1\n\t/temporal/service/history/taskProcessor.go:211\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:103\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck\n\t/temporal/service/history/taskProcessor.go:238\ngo.temporal.io/server/service/history.(*taskProcessor).taskWorker\n\t/temporal/service/history/taskProcessor.go:161”}
{“level”:“warn”,“ts”:“2022-07-26T07:08:01.547Z”,“msg”:“Cannot find namespace, default to process task.”,“service”:“history”,“shard-id”:180,“address”:“11.16.168.138:7234”,“shard-item”:“0xc001538c80”,“wf-namespace-id”:“b6957493-3e3f-4c8b-9e0d-bba9d46c5173”,“value”:“&TimerTaskInfo{NamespaceId:b6957493-3e3f-4c8b-9e0d-bba9d46c5173,WorkflowId:667D5919A2CC4821ACA42129F57A0921@AXkBAwA-ZXY220630A01-4-CC-28,RunId:c80b26a8-5adf-4182-b3a3-178c19b3f3d8,TaskType:DeleteHistoryEvent,TimeoutType:Unspecified,WorkflowBackoffType:Unspecified,Version:0,ScheduleAttempt:1,EventId:0,TaskId:23069335,VisibilityTime:2022-07-26 07:08:01.541051507 +0000 UTC,}”,“logging-call-at”:“taskAllocator.go:80”}

We are concerned about the memory usage. Please let us know if there is any action to be taken or any suggestions.

ravikiran · August 4, 2022, 6:23am

temporal server 1.9.2
@maxim @tihomir pls suggest.

tihomir · August 8, 2022, 6:31pm

Hi, sorry for late reply on this.

Do you make any changes in config regarding history cache sizes? Specifically settings:

history.cacheInitialSize default 128

history.cacheMaxSize default 512

history.eventsCacheInitialSize default 128

history.eventsCacheMaxSize default 512

These are per shard configurations, how many history shards do you define for cluster?

Caches items should be cleared if the the limits max size is reached. There is no periodic cache cleaning.

Regarding the error, are you getting resource exhausted errors? If you have server metrics enabled check maybe:

sum(rate(service_errors_resource_exhausted{}[1m])) by (resource_exhausted_cause)

tihomir · August 8, 2022, 6:35pm

temporal server 1.9.2

Would suggest to upgrade server version, latest is 1.17.2.

The upgrade might not fix the Unable to get namespace tho, could share any other related logs that you might find, anything in matching/frontend?

ravikiran · August 9, 2022, 3:50am

@tihomir restart of history nodes will clear the cache?
will get the shard configs configured.

ravikiran · August 10, 2022, 2:06pm

Number of shards: 4096
history nodes: 6
frontend nodes: 3
matching nodes: 3
admintool nodes: 1
web nodes: 2
worker nodes: 2

No

All the below configs are default
history.cacheInitialSize
history.cacheMaxSize
history.eventsCacheInitialSize
history.eventsCacheMaxSize

We have enabled server metrics, but there are no values for this metric “service_errors_resource_exhausted”

We restarted the entire cluster 2 days back as it reached beyond 75% mem usage. Since restart, the memory has again seen climbing steadily and is now at 20%.

tihomir · August 10, 2022, 5:12pm

as it reached beyond 75% mem usage. Since restart, the memory has again seen climbing steadily and is now at 20%.

Can you share the actual mem use (not in %), memory usage by % is not best indication of possible issues. Thanks.

ravikiran · August 10, 2022, 6:33pm

History nodes are configured with 12GB memory.
Currently consuming 2.1 GB (~20%)
Before restart 8.1 GB (~68%) Apologies for exaggerating this number in previous reply
Screenshot of grafana for last 7 days (break shows the restart)

tihomir · August 11, 2022, 7:59pm

Have you had the chance to do a heap profile for your history service? If so can you show?

For example:
go tool pprof -pdf http://localhost:7936/debug/pprof/heap ;

ravikiran · August 19, 2022, 10:29am

Unable to do a heap profile as we don’t have the required libs installed. We cannot get them installed due to environment restrictions.

Is there any other way to get the heap profile?

Are there any suspects/reasons/scenarios in which such memory increase is seen?
Current usage has grown to 6.5 GB out of 12 GB.

ravikiran · August 23, 2022, 4:54pm

@tihomir Please suggest any course of action. We had to restart the cluster again as it reached 65% memory usage.

tihomir · August 23, 2022, 5:38pm

I think to be able to look at specifics heap profile would help if possible.
For interim, do you set history.cacheMaxSize config in your dynamic config? If so to what value? Maybe lowering it could reduce the memory used by history service, but note it could have affect on performance of your wf executions (as you would have a smaller number of cached items that would not need to be looked up from db)

ravikiran · August 24, 2022, 5:46pm

history.cacheMaxSize is 512 default value

{"level":"info","ts":"2022-08-22T13:33:21.440Z","msg":"Get dynamic config","name":"history.cacheMaxSize","value":512,"default-value":512,"logging-call-at":"config.go:79"}

tihomir · August 24, 2022, 6:54pm

Would try lowering it. The default can indeed be too high depending how much mem you allocate to single history pod. Try half and see if it makes a difference, but watch your service latencies.

ravikiran · September 1, 2022, 4:10am

Since reducing the history.cacheMaxSize might affect the service latency, do you think increasing (doubling) the pod memory is a good idea?

Current pod mem: 12 GB
history.cacheMaxSize: 512 MB

proposal
pod mem: 24 GB
history.cacheMaxSize: 512 MB

ravikiran · September 7, 2022, 9:34am

@tihomir Can you pls share your thoughts on above comments.

tihomir · September 7, 2022, 1:10pm

I think this will depend on you application load (e.g. number of workflow executions / time) and the number of history hosts you set up. Would try the mentioned config (24GB pod mem per history node) and load test to make sure the consumed mem is within acceptable ranges.

JerryHu · October 26, 2022, 9:45pm

Hi @tihomir,

So by default, each shard will occupy 1.5G memory for cache purpose? If the total shard number is big, like 10k, then history service pod could hit OOM before shard cache is cleared?

Thanks,
Jerry

tihomir · February 22, 2023, 5:05am

Per shard cache dynamic configs:

history.cacheInitialSize (default 128)
history.cacheMaxSize (default 512)
history.eventsCacheInitialSize (default 128)
history.eventsCacheMaxSize (default 512)

is number of items / events cached and can differ depending on how many executions shard is hosting. If you have 10K shards typically you would want at least 10 history hosts (so about 1K shards per host).

Topic		Replies	Views
History Mem Usage, Cache Size & TTL Community Support docker	7	2232	April 27, 2023
Temporal History Service Memory Usage Community Support history , metrics	4	2643	June 29, 2022
Memory leak in Temporal History service v1.18.3 Community Support history , server	3	1532	November 2, 2022
Metrics about history node host elvel cache usage Server Deployment	1	20	December 4, 2024
Config to restrict history pod size Community Support history , configuration	3	1626	June 15, 2021

History service memory usage

Related topics