Recommendation for K8S Cluster; currently using default values

Hello Team,

Our team has deployed Temporal using the Helm charts provided here: GitHub - temporalio/helm-charts: Temporal Helm charts

Recently, we got the following error when trying to create a new schedule:

StatusRuntimeException: UNAVAILABLE: Not enough hosts to serve the request

After doing a bit of research, it turns out that the temporaltest-history pods were being killed with OOMKiller(137) and restarting over and over again.

The deployment is on AWS EKS, and the pods have the minimum amount or resources for memory and cpu; 0.25 units for CPU and 0.5 Gib for Memory.

As we are just starting to use Temporal, our avg/peak WPS is < 10 WPS.

Another topic has the following recommendations:

  • Frontend: …Start with 3 instances of 4 cores and 4GB memory.
  • History: …Start with 5 history instances with 8 cores and 8 GB memory.
  • Matching: …Start with 3 matching instances each with 4 cores and 4 GB memory.
  • Worker: …You can just start with 2 instances each with 4 cores and 4 GB memory.

However, these values seem a bit high for our current operations scale.

Can anyone give recommendations based on throughput?

@samar for input.
Thank You

@sebastian_violet can you share the following information for your cluster?

  1. Temporal Server Version
  2. Number of Shards
  3. Config: Max History Cache Size (“history.cacheMaxSize”)
  4. Config: Events Cache Size (“history.eventsCacheMaxSizeBytes”)

Here are the values:

  1. Temporal Server Version
server.image.repository=temporalio/server
server.image.tag=1.22.1
  1. Number of Shards
server.config.numHistoryShards=512
  1. Config: Max History Cache Size (“history.cacheMaxSize”)
  1. Config: Events Cache Size (“history.eventsCacheMaxSizeBytes”)

These values might be too high for your setup. These cache sizes are per shard, so if you have event payloads which are larger then it could quickly add up and take up all the memory on your history service. Based on the numbers you provided above:
MutableStateCacheSize = 512 (shards) X 512 (Default Max Items) = 256K Items
HistoryEvent Cache Size = 512 (shards) X 512 (Default Max Items) = 256K Items

So if you use an avg of 10KB per item then these 2 caches could take upto 5 Gigs of memory. Can you share how many history service pods you have and what is the memory limit on those pods?

Can you try configuring the cache sizes to smaller number and see if this helps with the OOM?

We have workstreams in progress to make this simpler by configuring it per history host and setting the limit in bytes rather the cache items.

And yes, please use dynamic_config to override these limits.

Hello @samar,

Currently, the history service has 3 pods each with 2 Gigs of memory.

I will try and adjust the caches to a lower number and see if that resolves it.

Thank You

Here are the changes I made based on your formula above:

---
# Source: temporal/templates/server-dynamicconfigmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: "temporaltest-dynamic-config"
  labels:
    app.kubernetes.io/name: temporal
    helm.sh/chart: temporal-0.29.0
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/instance: temporaltest
    app.kubernetes.io/version: 1.22.1
    app.kubernetes.io/part-of: temporal
data:
  dynamic_config.yaml: |-
    history:
      cacheInitialSize:
      - value: 32
      cacheMaxSize:
      - value: 64
      eventsCacheInitialSize:
      - value: 32
      eventsCacheMaxSize:
      - value: 64

Would Item in this case be a single line(an Event) in the workflow, or the entire workflow(comprised of multiple events)? I clicked on download on a workflow, and it showed 30K for the json file.

Also, is the cache here for the workers to be able to continue from where the last execution left off, or is it for the Web UI to display results quick?

What are the ramifications of the following:

  • No cache
  • Small cache

Large cache we know takes up too much memory in the history pods.

Thank You

Also, this doesn’t seem to work and throws the following error:

{"level":"error","ts":"2023-10-27T00:41:20.281Z","msg":"Unable to update dynamic config.","error":"unable to decode dynamic config: yaml: unmarshal errors:\n  line 2: cannot unmarshal !!map into []struct { Constraints map[string]interface {}; Value i │
│ nterface {} }","logging-call-at":"file_based_client.go:136","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:156\ngo.temporal.io/server/common/dynamicconfig.(*fileBasedClient).init.f │
│ unc1\n\t/home/builder/temporal/common/dynamicconfig/file_based_client.go:136"}

I see that the key itself has . in it. Thus for Helm --set, it needs to be like so:

...
  --set "server.dynamicConfig.history\.cacheInitialSize[0].value"=32 \
  --set "server.dynamicConfig.history\.cacheMaxSize[0].value"=64 \
  --set "server.dynamicConfig.history\.eventsCacheInitialSize[0].value"=32 \
  --set "server.dynamicConfig.history\.eventsCacheMaxSize[0].value"=64 \
...

I.e. history\.<key>

This gives the following:

---
# Source: temporal/templates/server-dynamicconfigmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: "temporaltest-dynamic-config"
  labels:
    app.kubernetes.io/name: temporal
    helm.sh/chart: temporal-0.29.0
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/instance: temporaltest
    app.kubernetes.io/version: 1.22.1
    app.kubernetes.io/part-of: temporal
data:
  dynamic_config.yaml: |-
    history.cacheInitialSize:
    - value: 32
    history.cacheMaxSize:
    - value: 64
    history.eventsCacheInitialSize:
    - value: 32
    history.eventsCacheMaxSize:
    - value: 64

You should be able to edit values.dynamic_config.yaml directly, does this not work for you?

The setup from above works perfectly. I just had to escape the . when using --set during helm install. The . is part of the config name and not indicative of a sub property.

history.cacheInitialSize

vs

history:
  cacheInitialSize