hi,
We are seeing gradual increase in memory usage of History nodes over last few days (from 15% to 55%)
Following errors are seen continuously in the logs
{“level”:“error”,“ts”:“2022-07-26T07:08:01.073Z”,“msg”:“Unable to get namespace”,“service”:“history”,“shard-id”:544,“address”:“11.16.168.138:7234”,“shard-item”:“0xc00229d380”,“component”:“timer-queue-processor”,“cluster-name”:“active”,“component”:“timer-queue-processor”,“error”:“namespace ID: b6957493-3e3f-4c8b-9e0d-bba9d46c5173 not found”,“logging-call-at”:“taskProcessor.go:347”,“stacktrace”:“go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:136\ngo.temporal.io/server/service/history.(*taskProcessor).getNamespaceTagByID\n\t/temporal/service/history/taskProcessor.go:347\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskOnce\n\t/temporal/service/history/taskProcessor.go:259\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck.func1\n\t/temporal/service/history/taskProcessor.go:211\ngo.temporal.io/server/common/backoff.Retry\n\t/temporal/common/backoff/retry.go:103\ngo.temporal.io/server/service/history.(*taskProcessor).processTaskAndAck\n\t/temporal/service/history/taskProcessor.go:238\ngo.temporal.io/server/service/history.(*taskProcessor).taskWorker\n\t/temporal/service/history/taskProcessor.go:161”}
{“level”:“warn”,“ts”:“2022-07-26T07:08:01.547Z”,“msg”:“Cannot find namespace, default to process task.”,“service”:“history”,“shard-id”:180,“address”:“11.16.168.138:7234”,“shard-item”:“0xc001538c80”,“wf-namespace-id”:“b6957493-3e3f-4c8b-9e0d-bba9d46c5173”,“value”:“&TimerTaskInfo{NamespaceId:b6957493-3e3f-4c8b-9e0d-bba9d46c5173,WorkflowId:667D5919A2CC4821ACA42129F57A0921@AXkBAwA-ZXY220630A01-4-CC-28,RunId:c80b26a8-5adf-4182-b3a3-178c19b3f3d8,TaskType:DeleteHistoryEvent,TimeoutType:Unspecified,WorkflowBackoffType:Unspecified,Version:0,ScheduleAttempt:1,EventId:0,TaskId:23069335,VisibilityTime:2022-07-26 07:08:01.541051507 +0000 UTC,}”,“logging-call-at”:“taskAllocator.go:80”}
We are concerned about the memory usage. Please let us know if there is any action to be taken or any suggestions.
Number of shards: 4096
history nodes: 6
frontend nodes: 3
matching nodes: 3
admintool nodes: 1
web nodes: 2
worker nodes: 2
No
All the below configs are default
history.cacheInitialSize
history.cacheMaxSize
history.eventsCacheInitialSize
history.eventsCacheMaxSize
We have enabled server metrics, but there are no values for this metric “service_errors_resource_exhausted”
We restarted the entire cluster 2 days back as it reached beyond 75% mem usage. Since restart, the memory has again seen climbing steadily and is now at 20%.
History nodes are configured with 12GB memory.
Currently consuming 2.1 GB (~20%)
Before restart 8.1 GB (~68%) Apologies for exaggerating this number in previous reply
Screenshot of grafana for last 7 days (break shows the restart)
I think to be able to look at specifics heap profile would help if possible.
For interim, do you set history.cacheMaxSize config in your dynamic config? If so to what value? Maybe lowering it could reduce the memory used by history service, but note it could have affect on performance of your wf executions (as you would have a smaller number of cached items that would not need to be looked up from db)
Would try lowering it. The default can indeed be too high depending how much mem you allocate to single history pod. Try half and see if it makes a difference, but watch your service latencies.
I think this will depend on you application load (e.g. number of workflow executions / time) and the number of history hosts you set up. Would try the mentioned config (24GB pod mem per history node) and load test to make sure the consumed mem is within acceptable ranges.
So by default, each shard will occupy 1.5G memory for cache purpose? If the total shard number is big, like 10k, then history service pod could hit OOM before shard cache is cleared?
is number of items / events cached and can differ depending on how many executions shard is hosting. If you have 10K shards typically you would want at least 10 history hosts (so about 1K shards per host).