Error: Task queue range ID was X when it was should have been Y

Question:

What is the severity of this error log message and is the issue automatically reconciled?

{
  "level": "error",
  "ts": "2022-03-22T00:23:20.525Z",
  "msg": "Persistent store operation failure",
  "service": "matching",
  "component": "matching-engine",
  "wf-task-queue-name": "/_sys/app-name/3",
  "wf-task-queue-type": "Activity",
  "wf-namespace": "default",
  "store-operation": "update-task-queue",
  "error": "Task queue range ID was 631 when it was should have been 630",
  "logging-call-at": "taskReader.go:187",
  "stacktrace": "go.temporal.io/server/common/log.(*zapLogger).Error\n\t/temporal/common/log/zap_logger.go:142\ngo.temporal.io/server/service/matching.(*taskReader).getTasksPump\n\t/temporal/service/matching/taskReader.go:187\ngo.temporal.io/server/internal/goro.(*Group).Go.func1\n\t/temporal/internal/goro/group.go:57"
}

Story:

I am testing a database outage to understand how Temporal behaves in the scenario. A bunch of these errors were logged over the course of 8 hours after the event.

How to reproduce:

Remove the database credentials that the server components use to connect to the database for approximately 5 minutes. I am using mysql. Restore the credential or update the components with new credentials that do have access.

Source

Hi @buhay

What is the severity of this error log message and is the issue automatically reconciled?

If the error is transient, meaning it resolves itself then it can be ignored.

By transient or resolved, I take you to mean that log lines identifying the mismatch in task queue range id eventually stop? Or should I be looking for another log line that identifies the mismatch as reconciled?

that log lines identifying the mismatch in task queue range id eventually stop

Yes, in your described outage test you are blocking and then enabling DB access it seems and these errors can be expected in those cases. Just make sure that they are transient (eventually stop after enabling access, as you mention).

1 Like

First of all, thanks for the awesome tooling. Running it has provided some pretty sweet early wins. I’m super excited to keep rolling out use cases.

We’ve experienced this issue a couple times with varying degrees of impact. The most recent one resulted in “Scheduled Time to Start” going from ~90ms to 15s. Which probably wouldn’t be a huge issue for offline processing, but we’re currently using Temporal mostly for customer facing Saga support.

During this last issue we were running frontend, matching, history, and workers all as two replicas in our k8s cluster. It looks like we lost one of our frontend pods, and k8s added another one. That deploy sent matching into a “Task queue range ID was X when it was should have been X-1” tail spin.

After this most recent issue, we have horizontally scaled our replica count and while we still see a few of these errors on deploy they are much fewer. Are there additional steps that we should be doing to mitigate this issue?

This error message itself is benign, it means when matching service try to update task queue (usually for ack level or update range ID), and the range ID in the database has already changed. This could happen if membership of the task queue changed from one matching service to another. Usually, this is because the membership in the ring changed (host added or removed, or k8s pod terminated and replaced). It should stable pretty quickly like within 1 minute or so. However, if you keep seeing this error log continuously for long time, it could indicate some real problem.

Thanks for the response Yimin!

That makes total sense and is consistent with some of the testing that we’ve done. These errors can show up as membership changes, they usually stop fairly quickly, and they don’t tend to have result in other issues.

I think I’m still confused on two issues:

  1. Does this usually impact ScheduledToStart latency by a few orders of magnitude?
  2. Is this common to self hosted clusters?
  3. Is this common to the cloud offering?

Thanks again for your help

This should not impact ScheduleToStart latency. Make sure you have enough pollers for your task queue. The default task queue partition count is 4, and you should have at least 10 pollers or even more to make sure each partition have some pollers. If you have very low traffic, you may want to reduce the task queue partition count via dynamic config:

    matching.numTaskqueueReadPartitions:
    - value: 1
      constraints: {}
    matching.numTaskqueueWritePartitions:
    - value: 1
      constraints: {}

Matching service lost ownership is common whenever k8s doing something to replace your pod or anything like that. But otherwise, it should not just happen on its own.

AWESOME! Thank you

What counts as “very low traffic”?
A. 1 workflow per minute
B. 1 workflow per second
C. 10 workflows per second
D. 100 workflows per second

Just trying to get an order of magnitude here