History service memory usage

hi,
We are seeing gradual increase in memory usage of History nodes over last few days (from 15% to 55%)

Following errors are seen continuously in the logs

{“level”:“error”,“ts”:“2022-07-26T07:08:01.073Z”,“msg”:“Unable to get namespace”,“service”:“history”,“shard-id”:544,“address”:“11.16.168.138:7234”,“shard-item”:“0xc00229d380”,“component”:“timer-queue-processor”,“cluster-name”:“active”,“component”:“timer-queue-processor”,“error”:“namespace ID: b6957493-3e3f-4c8b-9e0d-bba9d46c5173 not found”,“logging-call-at”:“taskProcessor.go:347”,“stacktrace”:“go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:136\ngo.temporal.io/server/service/history.(*taskProcessor).getNamespaceTagByID\n\t/temporal/service/history/taskProcessor.go:347\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskOnce\n\t/temporal/service/history/taskProcessor.go:259\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck.func1\n\t/temporal/service/history/taskProcessor.go:211\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:103\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck\n\t/temporal/service/history/taskProcessor.go:238\ngo.temporal.io/server/service/history.(*taskProcessor).taskWorker\n\t/temporal/service/history/taskProcessor.go:161”}
{“level”:“warn”,“ts”:“2022-07-26T07:08:01.547Z”,“msg”:“Cannot find namespace, default to process task.”,“service”:“history”,“shard-id”:180,“address”:“11.16.168.138:7234”,“shard-item”:“0xc001538c80”,“wf-namespace-id”:“b6957493-3e3f-4c8b-9e0d-bba9d46c5173”,“value”:“&TimerTaskInfo{NamespaceId:b6957493-3e3f-4c8b-9e0d-bba9d46c5173,WorkflowId:667D5919A2CC4821ACA42129F57A0921@AXkBAwA-ZXY220630A01-4-CC-28,RunId:c80b26a8-5adf-4182-b3a3-178c19b3f3d8,TaskType:DeleteHistoryEvent,TimeoutType:Unspecified,WorkflowBackoffType:Unspecified,Version:0,ScheduleAttempt:1,EventId:0,TaskId:23069335,VisibilityTime:2022-07-26 07:08:01.541051507 +0000 UTC,}”,“logging-call-at”:“taskAllocator.go:80”}

We are concerned about the memory usage. Please let us know if there is any action to be taken or any suggestions.

temporal server 1.9.2
@maxim @tihomir pls suggest.

Hi, sorry for late reply on this.

Do you make any changes in config regarding history cache sizes? Specifically settings:

history.cacheInitialSize default 128

history.cacheMaxSize default 512

history.eventsCacheInitialSize default 128

history.eventsCacheMaxSize default 512

These are per shard configurations, how many history shards do you define for cluster?

Caches items should be cleared if the the limits max size is reached. There is no periodic cache cleaning.

Regarding the error, are you getting resource exhausted errors? If you have server metrics enabled check maybe:

sum(rate(service_errors_resource_exhausted{}[1m])) by (resource_exhausted_cause)

temporal server 1.9.2

Would suggest to upgrade server version, latest is 1.17.2.

The upgrade might not fix the Unable to get namespace tho, could share any other related logs that you might find, anything in matching/frontend?

@tihomir restart of history nodes will clear the cache?
will get the shard configs configured.

Number of shards: 4096
history nodes: 6
frontend nodes: 3
matching nodes: 3
admintool nodes: 1
web nodes: 2
worker nodes: 2

No

All the below configs are default
history.cacheInitialSize
history.cacheMaxSize
history.eventsCacheInitialSize
history.eventsCacheMaxSize

We have enabled server metrics, but there are no values for this metric “service_errors_resource_exhausted”

We restarted the entire cluster 2 days back as it reached beyond 75% mem usage. Since restart, the memory has again seen climbing steadily and is now at 20%.

as it reached beyond 75% mem usage. Since restart, the memory has again seen climbing steadily and is now at 20%.

Can you share the actual mem use (not in %), memory usage by % is not best indication of possible issues. Thanks.

History nodes are configured with 12GB memory.
Currently consuming 2.1 GB (~20%)
Before restart 8.1 GB (~68%) Apologies for exaggerating this number in previous reply :sweat_smile:
Screenshot of grafana for last 7 days (break shows the restart)

Have you had the chance to do a heap profile for your history service? If so can you show?

For example:
go tool pprof -pdf http://localhost:7936/debug/pprof/heap ;

Unable to do a heap profile as we don’t have the required libs installed. We cannot get them installed due to environment restrictions.

Is there any other way to get the heap profile?

Are there any suspects/reasons/scenarios in which such memory increase is seen?
Current usage has grown to 6.5 GB out of 12 GB.