Hi, many of the metrics are pretty straightforward to understand but have a few questions on the below ones. The restart graph shows >0 number, whereas the k8s doesn’t show the same, it is 0 there. Can you please explain if this is of some concern and what do they represent?
In our test with two task queues each taking 400 new workflows/sec, the errored workflows are pretty much 0.3% of the total over a period of 8 hours, however this graph shows a huge gap of (780-550=230), what is the gap here representing? As failed are 0 and we are not seeing so many errors in our test.
What are these entity-not-found errors on history/frontend server and how to fix them? Again, these errors are somehow not showing up in our test where the error rate is only 0.3% over 8 hours.