Recommendation for K8S Cluster; currently using default values

sebastian_violet · October 24, 2023, 9:52pm

Hello Team,

Our team has deployed Temporal using the Helm charts provided here: GitHub - temporalio/helm-charts: Temporal Helm charts

Recently, we got the following error when trying to create a new schedule:

StatusRuntimeException: UNAVAILABLE: Not enough hosts to serve the request

After doing a bit of research, it turns out that the temporaltest-history pods were being killed with OOMKiller(137) and restarting over and over again.

The deployment is on AWS EKS, and the pods have the minimum amount or resources for memory and cpu; 0.25 units for CPU and 0.5 Gib for Memory.

As we are just starting to use Temporal, our avg/peak WPS is < 10 WPS.

Another topic has the following recommendations:

Frontend: …Start with 3 instances of 4 cores and 4GB memory.
History: …Start with 5 history instances with 8 cores and 8 GB memory.
Matching: …Start with 3 matching instances each with 4 cores and 4 GB memory.
Worker: …You can just start with 2 instances each with 4 cores and 4 GB memory.

However, these values seem a bit high for our current operations scale.

Can anyone give recommendations based on throughput?

sebastian_violet · October 24, 2023, 9:53pm

@samar for input.
Thank You

samar · October 24, 2023, 11:26pm

@sebastian_violet can you share the following information for your cluster?

Temporal Server Version
Number of Shards
Config: Max History Cache Size (“history.cacheMaxSize”)
Config: Events Cache Size (“history.eventsCacheMaxSizeBytes”)

sebastian_violet · October 25, 2023, 4:31pm

Here are the values:

Temporal Server Version

server.image.repository=temporalio/server
server.image.tag=1.22.1

Number of Shards

server.config.numHistoryShards=512

Config: Max History Cache Size (“history.cacheMaxSize”)

I assume the default(128) based on this: History service memory usage - #4 by tihomir
I don’t see a place to set it in the Helm Chart
- I assume we can set it here /etc/temporal/dynamic_config

Config: Events Cache Size (“history.eventsCacheMaxSizeBytes”)

I assume the default(512) based on this: History service memory usage - #4 by tihomir
I don’t see a place to set it in the Helm Chart
- I assume we can set it here /etc/temporal/dynamic_config

samar · October 26, 2023, 4:10pm

These values might be too high for your setup. These cache sizes are per shard, so if you have event payloads which are larger then it could quickly add up and take up all the memory on your history service. Based on the numbers you provided above:
MutableStateCacheSize = 512 (shards) X 512 (Default Max Items) = 256K Items
HistoryEvent Cache Size = 512 (shards) X 512 (Default Max Items) = 256K Items

So if you use an avg of 10KB per item then these 2 caches could take upto 5 Gigs of memory. Can you share how many history service pods you have and what is the memory limit on those pods?

Can you try configuring the cache sizes to smaller number and see if this helps with the OOM?

We have workstreams in progress to make this simpler by configuring it per history host and setting the limit in bytes rather the cache items.

And yes, please use dynamic_config to override these limits.

sebastian_violet · October 26, 2023, 5:02pm

Hello @samar,

Currently, the history service has 3 pods each with 2 Gigs of memory.

I will try and adjust the caches to a lower number and see if that resolves it.

Thank You

sebastian_violet · October 26, 2023, 5:57pm

Here are the changes I made based on your formula above:

---
# Source: temporal/templates/server-dynamicconfigmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: "temporaltest-dynamic-config"
  labels:
    app.kubernetes.io/name: temporal
    helm.sh/chart: temporal-0.29.0
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/instance: temporaltest
    app.kubernetes.io/version: 1.22.1
    app.kubernetes.io/part-of: temporal
data:
  dynamic_config.yaml: |-
    history:
      cacheInitialSize:
      - value: 32
      cacheMaxSize:
      - value: 64
      eventsCacheInitialSize:
      - value: 32
      eventsCacheMaxSize:
      - value: 64

sebastian_violet · October 26, 2023, 6:04pm

Would Item in this case be a single line(an Event) in the workflow, or the entire workflow(comprised of multiple events)? I clicked on download on a workflow, and it showed 30K for the json file.

Also, is the cache here for the workers to be able to continue from where the last execution left off, or is it for the Web UI to display results quick?

What are the ramifications of the following:

No cache
Small cache

Large cache we know takes up too much memory in the history pods.

Thank You

sebastian_violet · October 27, 2023, 12:43am

sebastian_violet:

 history:
      cacheInitialSize:
      - value: 32
      cacheMaxSize:
      - value: 64
      eventsCacheInitialSize:
      - value: 32
      eventsCacheMaxSize:
      - value: 64

Also, this doesn’t seem to work and throws the following error:

{"level":"error","ts":"2023-10-27T00:41:20.281Z","msg":"Unable to update dynamic config.","error":"unable to decode dynamic config: yaml: unmarshal errors:\n  line 2: cannot unmarshal !!map into []struct { Constraints map[string]interface {}; Value i │
│ nterface {} }","logging-call-at":"file_based_client.go:136","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:156\ngo.temporal.io/server/common/dynamicconfig.(*fileBasedClient).init.f │
│ unc1\n\t/home/builder/temporal/common/dynamicconfig/file_based_client.go:136"}

sebastian_violet · October 27, 2023, 12:49am

I see that the key itself has . in it. Thus for Helm --set, it needs to be like so:

...
  --set "server.dynamicConfig.history\.cacheInitialSize[0].value"=32 \
  --set "server.dynamicConfig.history\.cacheMaxSize[0].value"=64 \
  --set "server.dynamicConfig.history\.eventsCacheInitialSize[0].value"=32 \
  --set "server.dynamicConfig.history\.eventsCacheMaxSize[0].value"=64 \
...

I.e. history\.<key>

This gives the following:

---
# Source: temporal/templates/server-dynamicconfigmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: "temporaltest-dynamic-config"
  labels:
    app.kubernetes.io/name: temporal
    helm.sh/chart: temporal-0.29.0
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/instance: temporaltest
    app.kubernetes.io/version: 1.22.1
    app.kubernetes.io/part-of: temporal
data:
  dynamic_config.yaml: |-
    history.cacheInitialSize:
    - value: 32
    history.cacheMaxSize:
    - value: 64
    history.eventsCacheInitialSize:
    - value: 32
    history.eventsCacheMaxSize:
    - value: 64

tihomir · October 28, 2023, 9:19pm

You should be able to edit values.dynamic_config.yaml directly, does this not work for you?

sebastian_violet · October 30, 2023, 3:50pm

The setup from above works perfectly. I just had to escape the . when using --set during helm install. The . is part of the config name and not indicative of a sub property.

history.cacheInitialSize

vs

history:
  cacheInitialSize

Topic		Replies	Views
Temporal test bench by maru Community Support go-sdk , helm , metrics	3	736	December 22, 2022
[SOLVED] "context deadline exceeded" & "Not enough hosts to serve requests" errors Community Support kubernetes	1	31474	March 31, 2022
Improving Temporal cluster performance Server Deployment go-sdk , aws , scylla , kubernetes	1	960	November 21, 2022
numHistoryShards modify Community Support	6	1221	April 6, 2023
Temporal Server Config Community Support	1	636	September 23, 2022

Recommendation for K8S Cluster; currently using default values

Related topics