I wonder if there’s a metric that shows how much of the slots available on the history nodes are in fact used and if there’s a metric to inform about eviction from the cache to further tune the cache settings.
note these are per shard configs so number of shards and number of history hosts would play a role here, along with resources (heap size) you allocate to your history hosts.
for overall cache size, history service emits cache_size which you can filter by cache_type (that can be set to “events” or “mutablestate”)
for usage, history service emits cache_usage metrics that again you can filter by cache_type
you also have cache_miss metric that can filter by operation, and cache_type. note cache miss is expected on new workflow execution as well as if some events happen after execution completes, such as for example am user timer firing after your workflow has already completed (was not canceled).
as far as eviction goes, shard cache start evicting once its full, not before (so when cache_usage hits its max set). you can look at the cache_entry_age_on_eviction to give you understanding how long an entry was i cache before was evicted