History service memory usage

hi,
We are seeing gradual increase in memory usage of History nodes over last few days (from 15% to 55%)

Following errors are seen continuously in the logs

{“level”:“error”,“ts”:“2022-07-26T07:08:01.073Z”,“msg”:“Unable to get namespace”,“service”:“history”,“shard-id”:544,“address”:“11.16.168.138:7234”,“shard-item”:“0xc00229d380”,“component”:“timer-queue-processor”,“cluster-name”:“active”,“component”:“timer-queue-processor”,“error”:“namespace ID: b6957493-3e3f-4c8b-9e0d-bba9d46c5173 not found”,“logging-call-at”:“taskProcessor.go:347”,“stacktrace”:“go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:136\ngo.temporal.io/server/service/history.(*taskProcessor).getNamespaceTagByID\n\t/temporal/service/history/taskProcessor.go:347\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskOnce\n\t/temporal/service/history/taskProcessor.go:259\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck.func1\n\t/temporal/service/history/taskProcessor.go:211\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:103\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck\n\t/temporal/service/history/taskProcessor.go:238\ngo.temporal.io/server/service/history.(*taskProcessor).taskWorker\n\t/temporal/service/history/taskProcessor.go:161”}
{“level”:“warn”,“ts”:“2022-07-26T07:08:01.547Z”,“msg”:“Cannot find namespace, default to process task.”,“service”:“history”,“shard-id”:180,“address”:“11.16.168.138:7234”,“shard-item”:“0xc001538c80”,“wf-namespace-id”:“b6957493-3e3f-4c8b-9e0d-bba9d46c5173”,“value”:“&TimerTaskInfo{NamespaceId:b6957493-3e3f-4c8b-9e0d-bba9d46c5173,WorkflowId:667D5919A2CC4821ACA42129F57A0921@AXkBAwA-ZXY220630A01-4-CC-28,RunId:c80b26a8-5adf-4182-b3a3-178c19b3f3d8,TaskType:DeleteHistoryEvent,TimeoutType:Unspecified,WorkflowBackoffType:Unspecified,Version:0,ScheduleAttempt:1,EventId:0,TaskId:23069335,VisibilityTime:2022-07-26 07:08:01.541051507 +0000 UTC,}”,“logging-call-at”:“taskAllocator.go:80”}

We are concerned about the memory usage. Please let us know if there is any action to be taken or any suggestions.

temporal server 1.9.2
@maxim @tihomir pls suggest.

Hi, sorry for late reply on this.

Do you make any changes in config regarding history cache sizes? Specifically settings:

history.cacheInitialSize default 128

history.cacheMaxSize default 512

history.eventsCacheInitialSize default 128

history.eventsCacheMaxSize default 512

These are per shard configurations, how many history shards do you define for cluster?

Caches items should be cleared if the the limits max size is reached. There is no periodic cache cleaning.

Regarding the error, are you getting resource exhausted errors? If you have server metrics enabled check maybe:

sum(rate(service_errors_resource_exhausted{}[1m])) by (resource_exhausted_cause)

temporal server 1.9.2

Would suggest to upgrade server version, latest is 1.17.2.

The upgrade might not fix the Unable to get namespace tho, could share any other related logs that you might find, anything in matching/frontend?

@tihomir restart of history nodes will clear the cache?
will get the shard configs configured.

Number of shards: 4096
history nodes: 6
frontend nodes: 3
matching nodes: 3
admintool nodes: 1
web nodes: 2
worker nodes: 2

No

All the below configs are default
history.cacheInitialSize
history.cacheMaxSize
history.eventsCacheInitialSize
history.eventsCacheMaxSize

We have enabled server metrics, but there are no values for this metric “service_errors_resource_exhausted”

We restarted the entire cluster 2 days back as it reached beyond 75% mem usage. Since restart, the memory has again seen climbing steadily and is now at 20%.

as it reached beyond 75% mem usage. Since restart, the memory has again seen climbing steadily and is now at 20%.

Can you share the actual mem use (not in %), memory usage by % is not best indication of possible issues. Thanks.

History nodes are configured with 12GB memory.
Currently consuming 2.1 GB (~20%)
Before restart 8.1 GB (~68%) Apologies for exaggerating this number in previous reply :sweat_smile:
Screenshot of grafana for last 7 days (break shows the restart)

Have you had the chance to do a heap profile for your history service? If so can you show?

For example:
go tool pprof -pdf http://localhost:7936/debug/pprof/heap ;

Unable to do a heap profile as we don’t have the required libs installed. We cannot get them installed due to environment restrictions.

Is there any other way to get the heap profile?

Are there any suspects/reasons/scenarios in which such memory increase is seen?
Current usage has grown to 6.5 GB out of 12 GB.

@tihomir Please suggest any course of action. We had to restart the cluster again as it reached 65% memory usage.

I think to be able to look at specifics heap profile would help if possible.
For interim, do you set history.cacheMaxSize config in your dynamic config? If so to what value? Maybe lowering it could reduce the memory used by history service, but note it could have affect on performance of your wf executions (as you would have a smaller number of cached items that would not need to be looked up from db)

history.cacheMaxSize is 512 default value

{"level":"info","ts":"2022-08-22T13:33:21.440Z","msg":"Get dynamic config","name":"history.cacheMaxSize","value":512,"default-value":512,"logging-call-at":"config.go:79"}

Would try lowering it. The default can indeed be too high depending how much mem you allocate to single history pod. Try half and see if it makes a difference, but watch your service latencies.

Since reducing the history.cacheMaxSize might affect the service latency, do you think increasing (doubling) the pod memory is a good idea?

Current pod mem: 12 GB
history.cacheMaxSize: 512 MB

proposal
pod mem: 24 GB
history.cacheMaxSize: 512 MB

@tihomir Can you pls share your thoughts on above comments.

I think this will depend on you application load (e.g. number of workflow executions / time) and the number of history hosts you set up. Would try the mentioned config (24GB pod mem per history node) and load test to make sure the consumed mem is within acceptable ranges.

Hi @tihomir,

So by default, each shard will occupy 1.5G memory for cache purpose? If the total shard number is big, like 10k, then history service pod could hit OOM before shard cache is cleared?

Thanks,
Jerry

Per shard cache dynamic configs:

history.cacheInitialSize (default 128)
history.cacheMaxSize (default 512)
history.eventsCacheInitialSize (default 128)
history.eventsCacheMaxSize (default 512)

is number of items / events cached and can differ depending on how many executions shard is hosting. If you have 10K shards typically you would want at least 10 history hosts (so about 1K shards per host).