History service doing all the work?

Not sure if this is normal. We have each service running in its own container. History seems to be the only one doing any real work. The one with SERVICES=worker set is basically idle, and has this in it’s logs:

{"level":"info","ts":"2023-01-27T12:01:00.701Z","msg":"Taskqueue scavenger stopped","service":"worker","logging-call-at":"scavenger.go:149"}

Does that mean it’s not working correctly? Our actual workflows are completing successfully, I’m just surprised the “worker” service isn’t doing anything :thinking:

The history service has SERVICES=history only, so afaik that would prevent it from acting as a worker? Is there a good way to validate each service is behaving as expected?

tctl admin cluster describe
{
  "supportedClients": {
    "temporal-cli": "\u003c2.0.0",
    "temporal-go": "\u003c2.0.0",
    "temporal-java": "\u003c2.0.0",
    "temporal-php": "\u003c2.0.0",
    "temporal-server": "\u003c2.0.0",
    "temporal-typescript": "\u003c2.0.0",
    "temporal-ui": "\u003c3.0.0"
  },
  "serverVersion": "1.19.0",
  "membershipInfo": {
    "currentHost": {
      "identity": "10.0.12.61:7233"
    },
    "reachableMembers": [
      "10.0.12.61:6933",
      "10.0.12.109:6934",
      "10.0.11.91:6935",
      "10.0.10.158:6939"
    ],
    "rings": [
      {
        "role": "frontend",
        "memberCount": 1,
        "members": [
          {
            "identity": "10.0.12.61:7233"
          }
        ]
      },
      {
        "role": "history",
        "memberCount": 1,
        "members": [
          {
            "identity": "10.0.12.109:7234"
          }
        ]
      },
      {
        "role": "matching",
        "memberCount": 1,
        "members": [
          {
            "identity": "10.0.11.91:7235"
          }
        ]
      },
      {
        "role": "worker",
        "memberCount": 1,
        "members": [
          {
            "identity": "10.0.10.158:7239"
          }
        ]
      }
    ]
  },
  "clusterId": "9245c85b-33e8-4862-901e-3c3d84c2a9cf",
  "clusterName": "active",
  "historyShardCount": 4096,
  "persistenceStore": "postgres",
  "visibilityStore": "postgres",
  "failoverVersionIncrement": "10",
  "initialFailoverVersion": "1"
}

Looks like each service is registered OK :thinking:

Interesting…

telnet 10.0.10.158 7239
telnet: can't connect to remote host (10.0.10.158): Connection refused

It’s like its not actually listening on the port it says it is running on. FWIW I can hit the other services via telnet.

I don’t have anything set for bindOnIP or broadcast addresses. Those are all valid IPs in our local AWS VPC.

Still not sure if its expected that the history service would run the tasks if the worker isn’t available though. Not sure if this is all normal or not.

On the worker service itself, its not even running a process that’s listening on that port :astonished:

netstat -ntlp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 127.0.0.1:2999          0.0.0.0:*               LISTEN      -
tcp        0      0 10.0.10.158:6939        0.0.0.0:*               LISTEN      -

Seems very weird, considering its advertised to the “ring” that it’s ready on port 7239?

Taskqueue scavenger stopped

This is just info message. Task queue scavenger is a background workflow (ran from the worker service) that removes unused task queues and stale tasks. The message means one of its runs has completed.

Seems very weird, considering its advertised to the “ring” that it’s ready on port 7239?

7239 is the default grpc port for the worker service, see here.

Are you deploying service on docker or k8s?

Just as example if you wanted to bash into your worker service with docker:

docker ps
(find the container id for your worker container)
docker exec -it <worker_container_id> bash

Another thing you can do to see if your worker service is “doing something” is via tctl, for example:

tctl -ns temporal-system wf l

Thanks!

Yeah, if I run tctl -ns temporal-system wf l I see tasks.

I guess my question is, when someone dispatches a “real” task via an SDK, should it be running on history? Or worker?

I would have thought “worker” is the service that does the real work, so kinda confused.

We are deploying on AWS ECS, which is basically docker.

I would have thought “worker” is the service that does the real work, so kinda confused.

Understand, and yes the “worker service” name is confusing. Maybe think of it as the “system workflows service”. It runs background system workflows for things like scavengers as you noticed but also batch operations, archival, namespace retention, etc.

I guess my question is, when someone dispatches a “real” task via an SDK, should it be running on history

From a high level view, frontend, history and matching services are the ones that are involved with your workflow executions directly. Frontend service you can think of as a “dispatcher”, all client calls go through it and it then forwards them. History service is the one that hosts your shards. Each shard hosts one or more of your workflow executions, meaning it is responsible to create and maintain the execution event history (and persist it), mutable state (server view of the exec), and has transfer queues responsible to ship different events to other services such as matching (for workflow, activity tasks, timers etc) as well as visibility (moving visibility tasks to your visibility store).
Matching service receives workflow/activity tasks for history service and it hosts the task queues that your workers define to poll your workflow and activity tasks from (your workers actually poll frontend service, but then frontend forwards these poll requests to matching).

Hope this helps a little.

1 Like

Very helpful! thank you :slight_smile: