Child workflow only initiated but not getting executed

tsanydzul · March 13, 2024, 6:55am

Hello, i got an issue where its intermittent, sometimes when there is a parent workflow tried to initiate a child workflow it got stuck for some reason

and something in common for every stuck CWF is the event of the CWF is only getting initiated but no event for executed, while in success execution i can see Cwf getting initiated and getting started

Stuck CWF

Success CWF

on the stacktrace of stuck cwf i found this

any suggestion how i trace the issue ? or how i can terminate cwf when its note getting executed yet ? since its stuck and cannot reset since the cwf still running

tsanydzul · March 15, 2024, 3:14am

When i tried to describe, the wf transaction got shard error

maxim · March 15, 2024, 6:42pm

If you are getting this consistently, then your cluster/database might be broken.

Ahmad_Ricky_Nazarrud · March 20, 2024, 4:40pm

do you have suggestion on how should we do to fix it

maxim · March 21, 2024, 12:59am

You should probably reinstall the cluster and make sure that the DB is fully consistent. Do you run the DB with some type of async replication?

Ahmad_Ricky_Nazarrud · March 25, 2024, 6:50am

I dont think we have any specific type of async replication.

What do you mean by reinstall the cluster? do we need to drop/truncate certain table or how?
and also does reinstall the cluster, will impact the existing workflow that running?

maxim · March 25, 2024, 6:00pm

I’m not sure if reinstalling will help if your DB loses records. Usually, this happens when some sort of replication is used, and it is not fully consistent. As I know nothing about your specific setup it is hard to give any recommendation.

Ahmad_Ricky_Nazarrud · March 25, 2024, 6:45pm

Here is the illustration of how we setup the cassandra right now

we already try temporal tctl command “tctl admin shard close_shard”, but it seems doesnt solve the problem.
we already try to exec cassandra command such as nodetool removenode, cleanup, repair. but the problem is still there.

we capture log from temporal history service:
{“level”:“error”,“ts”:“2024-02-02T02:54:57.744Z”,“msg”:“Persistent store operation failure”,“service”:“history”,“shard-id”:32,“address”:“10.59.68.74:7234”,“shard-item”:“0xc00277ea00”,“store-operation”:“update-shard”,“error”:“Failed to update shard. previous_range_id: 30415, columns: (range_id=30414)”,“shard-range-id”:30416,“previous-shard-range-id”:30415,“logging-call-at”:“context_impl.go:732”}

Ahmad_Ricky_Nazarrud · March 26, 2024, 5:55am

@maxim @tihomir

maxim · March 27, 2024, 3:27am

Sorry, I’m not sure why your Cassandra is inconsistent.

Ahmad_Ricky_Nazarrud · March 27, 2024, 11:06am

do you have any other suggestion for us what needs to be checked? or what should we do?

maxim · March 27, 2024, 4:11pm

I recommend finding Cassandra experts who can troubleshoot inconsistent behavior.

Another option is to use Temporal Coud and never care again about Temporal cluster and persistence management.

Ahmad_Ricky_Nazarrud · March 27, 2024, 8:36pm

If lets say i setup a new cluster, is there any way we migrate the existing data from existing to the new cluster? and not impacting the current workflow process/state

maxim · March 28, 2024, 5:03pm

Yes, you can migrate data using multi-cluster replication. But as your current cluster is corrupted, I’m not sure if this is going to work.

Ahmad_Ricky_Nazarrud · March 28, 2024, 5:33pm

do you know what this tctl command does? and when we need to use this command

tctl admin shard close_shard

Ahmad_Ricky_Nazarrud · March 28, 2024, 7:44pm

Hi @maxim , @tihomir ,
I try to execute tctl command to describe shard_id but there is one shard_id that the update_time is not updated like the other shard.

tctl admin shard describe

do you know why the shard_id is not updated?
if i want to block transaction to come to this shard, can i use command tctl admin shard close_shard ? this mean the new transaction after i close the shard should be directed to other shard.
or is there any other way for me to isolate that shard

Ahmad_Ricky_Nazarrud · April 1, 2024, 3:56am

Hi @maxim , @tihomir
do you mind to give me clearance on this?

Topic		Replies	Views
Operation updateShard encounter timeout Community Support history	9	1002	June 28, 2022
Can not schedule workflow with the same ID Community Support typescript-sdk	11	1809	July 2, 2022
Stuck workflows after hight database load Community Support general-impl	11	552	July 11, 2024
updateCurrentExecution failed. 0 rows of current_executions updated instead of 1 Community Support go-sdk	4	67	April 21, 2025
Occasionally workflow task won't be started after scheduled Community Support	17	660	September 2, 2025

Child workflow only initiated but not getting executed

Related topics