Activity scheduled but not started (need help)

Hi all
we recently doing some test on our workflow

but we got some problem which is some of the workflow got intermittent
activity scheduled but not started.

Is there any advice how to fix this problem
Thank You in advanced

coroutine root [blocked on chan-2.Receive]:
go.temporal.io/sdk/internal.(*decodeFutureImpl).Get(0xc0001ae5b8, 0x1380db0, 0xc00039e270, 0x0, 0x0, 0x2, 0x2)
/build/vendor/go.temporal.io/sdk/internal/internal_workflow.go:1334 +0x52


this code is from temporal go sdk

this is where it got stuck
err = workflow.ExecuteActivity(ctx, ProcessTrack, processTrackReq, activityArgsMap).Get(ctx, nil)

6
5

It appears you are blocked on .Get while it waits for the activity to complete. But if it’s only showing activity scheduled, it’s possible the activity is not getting picked up by the worker for whatever reason. Please confirm you have a running worker for the activity queue.

If you want to error when an activity isn’t started in a certain amount of time, you can set the schedule to start timeout in the activity options (or if you want to have a timeout for schedule and completion, can set the schedule to close timeout).

Just to add, here is a video that explains activity timeouts.

Hi Chad

yes we have a running worker for the activity queue


Currently we only set the start to close timeout = 60s
is it enough to only set the schedule to close timeout?

from this post How to test Server outageous - #3 by anil_kumble
Maxim stated that for the start to schedule timeout, we don’t really need to configure it.

Our activity is not long running activity and only rest api call, is there any suggestion on how much should we configure it?
Thank You

Check your logs and/or metrics to potentially see why an activity may not get run on your worker. Assuming the task queue is the same as the workflow there, from the limited information provided, I don’t see any reason why the activity is not getting started.

Can you reproduce this easily? Can you reduce that replication down to a small set of runnable sample code we can help you debug?

Basically, a start-to-close timeout won’t include how long an activity takes to go from scheduled to started. You need either schedule-to-start or schedule-to-close if you want to timeout when an activity is slow to schedule. There is no specific suggestion, just give it a timeout that you expect it to start within (or give schedule-to-close the timeout you expect it to start and then complete within).

For example, if you expect an activity including all retries should only take 10 seconds, set the schedule-to-close to 10 seconds. But this will just return an error. This does not solve your problem of why your activity is not running in the first place.

Hi Chad, from worker log perspective we saw none because like what you mention earlier that the process is blocked in .Get()

Interesting thing happened here is it’s only affecting one out of ten transactions we tested. We did see that workers are polling to the task queue since it’s reflected on temporal web UI. Please find image below

I read some articles that web UI only will display event started after it completed/failed. So is it possible that worker fetched the data and not processing it? But if it was picked up by the worker, technically start to close timeout would have already kicked in.

Any additional suggestion on how to identify the problem?

Edit: Add task queue screenshot

Correct. It is possible the system has reached max activities, but one would expect it to be picked up at some point when other activities completed unless other activities are not completing. But that would result in the activity poller not polling anymore (which would show in the UI after some time I believe).

The best way to solve any of these problems is attempt to replicate. Once you can reliably replicate, you can begin the debugging process. This debugging process can then include logging, literally debugging in your IDE with breakpoints, etc. If you believe this is an SDK bug and you can provide a reliable reproducer, we may be able to help debug as well. I am afraid I have no other ideas as to why your worker is not picking up your activities.

I see. But don’t think we reached maximum activities since we’re still testing it out.

Are you saying reproduce, then try debugging in IDE, adding log points into temporal’s go SDK? because only thing to add is log point in the SDK if we’re blocked in one of the SDK function.

Not sure if this is related, we never encountered this when we are testing it in one DC region. We only encountered this issue when we tried to test using 2 regions.

High level architecture in this test is we are setting up 2 regions/DCs then deployed temporal worker and temporal server into a kubernetes cluster. we set up cassandra and ELK to be a once cluster between these two regions. Technically it should work like 1 DC, but could affect performance depending on between-region network latency? Below some illustration on the test architecture.

Couple questions

  1. in your test is your namespace you are running the workflow on DC local or a global namespace?
  2. Cant full tell from picture, you you have a single strongly-consistent DB that both your temporal clusters connect to?
  3. Do you have server metrics enabled and Grafana dashboards set up?
  1. What we did we create a namespace and let temporal worker connect to it. Is it defaulted to DC local namespace? Need clarification, if our setup is like this: temporal worker and temporal server in different DC and both cassandra DB and ELK is 1 cluster in 2 DC with active replication, is this still considered 2 separate clusters?
  2. Yes we have strong active-active cassandra DB replication between DC. Maybe you have additional parameter in cassandra DB that we can looked into it, just in case we missed out something?
  3. Currently we are not configuring metrics and grafana.

Can every instance Temporal Server pod in Region1 talk to every instance of Temporal Server pod in Region 2. If not then such setup is not going to work.

I see, I think in our setup they won’t communicate with each other because it’s a separated kubernetes cluster. this is the ringpop protocol implementation? what matters is only temporal server and temporal worker is stateless?

Based on this discussion, what is the possible approach if we want to run temporal in 2 DC?
So far I got 2 options:

  1. To make temporal server could communicate between DC
  2. Using global namespace which is still in experimental state as per now

Is it possible if we do 2 DC but in active passive manner? so when 1 DC goes down, we start up temporal server and temporal worker in another DC?

Additional question out of curiosity, could you help to explain why this type of topology won’t work? Our initial thought workflow state is stored in persistance layer such as cassandra, so temporal server is nearly stateless by taking task from cassandra and put it into task queue, in such a way that temporal server can have different cluster that manage its cluster workers while still sharing state.

Edit: Tidy up and adding additional question

this is the ringpop protocol implementation? what matters is only temporal server and temporal worker is stateless?

See this presentation that explains how nodes communicate. It is not just ringpop. Any host should be able to call any other host directly by its address.

1 Like

Hi Maxim, Thanks for the reference presentation. agree that it is not only ringpop, as ringpop is one way for temporal node to find other nodes and construct a ring cluster.

Looking at temporal mechanism, temporal server itself is not stateless. So every component in temporal server (matching, history, frontend) need to be able to communicate with each other.

From what I see, the potential problem with our architecture is due to temporal components unable to communicate between DCs, temporal will have difficulties in managing the task queue, such as doing transfer queue from history service to matching service. Please correct me if I’m wrong.

So to make things works, either you:

  1. Make temporal servers between two DCs can communicate with each other, or,
  2. Only running temporal in one DC

Quick question, is it possible to do #1, if yes, how actually temporal can discover services that resides on other DC? Or #1 is only possible if you are doing multi-cluster setup (async replication) which is still marked as experimental?

I still roughly guessing that temporal able to discover each other by doing a broadcast into local network, hence I think it’s not possible to discover services outside of its own local network.

#1, if yes, how actually temporal can discover services that resides on other DC?

Each service instance on startup registers itself in the DB. So there is no need for broadcast. The only requirement is that they can reach each other by their IP addresses. So if you set some routing layer between your DCs, it would work.

1 Like

Ah ok, thanks for the insight Maxim!

Let us do further checking on this part. if it is not doable maybe we will go with 1 DC only.