Metrics about history node host elvel cache usage

Hi there,

I wonder if there’s a metric that shows how much of the slots available on the history nodes are in fact used and if there’s a metric to inform about eviction from the cache to further tune the cache settings.

OSS Temporal Service metrics reference | Temporal Platform Documentation does not seem to cover anything in that region but I can imagine things are not always immediately documented.

Cheers, Frank

Assume you are asking to tune history cache and events cache dynamic configs:

history.cacheInitialSize default 128
history.cacheMaxSize default 512
history.eventsCacheInitialSize default 128
history.eventsCacheMaxSize default 512

note these are per shard configs so number of shards and number of history hosts would play a role here, along with resources (heap size) you allocate to your history hosts.

for overall cache size, history service emits cache_size which you can filter by cache_type (that can be set to “events” or “mutablestate”)

for usage, history service emits cache_usage metrics that again you can filter by cache_type

you also have cache_miss metric that can filter by operation, and cache_type. note cache miss is expected on new workflow execution as well as if some events happen after execution completes, such as for example am user timer firing after your workflow has already completed (was not canceled).

as far as eviction goes, shard cache start evicting once its full, not before (so when cache_usage hits its max set). you can look at the cache_entry_age_on_eviction to give you understanding how long an entry was i cache before was evicted