Timeouts on visibility-tasks but workflows still completing

Seeing a lot of errors when running around 10k workflows all in short succession. DB load doesn’t go above 60% so don’t believe its a performance issue there at least. The errors in the logs of the history service look like.

visibility task 21~47186285 timedout waiting for ACK after 1m0s

Unsure if its related - but the matching service is also throwing a lot of errors that look like the following.

"error":"Workflow task not found."

Again - everything eventually completes fine, but it is taking longer than I’d expect. Wonder if anyone has any suggestions on tracking down the root cause.

The visibility task pushes data to Elastic Search. It looks like your ES is underpowered.

Just to add, for visibility you can measure visibility latencies via server metrics that could be good indication on possible issues:

histogram_quantile(0.95, sum(rate(task_latency_bucket{operation=~"VisibilityTask.*", service_name="history"}[1m])) by (operation, le))

as well as similar query for the task_latency_queue_bucket metric with which you can see how long visibility tasks are sitting in queue waiting to be processed.

Regarding "error":"Workflow task not found." this can be a transient error (so not affect your workflow executions) but would check your matching service cpu utilization as well.

Thanks - seemed to be this! Was using the count-api to see my workflows and that was going very slowly - didn’t register that was powered by elasticsearch so temporal itself was working fine!