Event corruption with child workflows? (Java SDK)

sp13 · October 18, 2021, 5:27pm

The code will work. The workflow will complete. But around 6s you might see a workflowtasktimeout (ScheduletoStart)

maxim · October 18, 2021, 5:34pm

I see. Yes I see the timeout. It happens because 150 child workflows try to start as fast as possible and all of them compete with the workflow task for the shard lock.

sp13 · October 18, 2021, 5:39pm

Okay, if this is by design then that is what it is. Problem is in a busy system we are seeing this task time out for just 6 childworkflows.

maxim · October 18, 2021, 5:40pm

How many shards is your busy system is provisioned with?

sp13 · October 18, 2021, 5:51pm

And when I said busy, it is just to an extent busy in getting workflowHistory for UI and not workflow execution. This is the configurations I have. We have a 3 node cluster in each service.

dynamicConfig:
    matching.numTaskqueueReadPartitions:
    - value: 5
      constraints: {}
    matching.numTaskqueueWritePartitions:
    - value: 5
      constraints: {}

sp13 · October 18, 2021, 5:52pm

If you comment the last line in my code, Promise.allOf(results).get(); I get the error for even 20 child workflows.

maxim · October 18, 2021, 5:54pm

This is task queue partitions configuration. What is the number of history shards?

persistence:
  defaultStore: default
  visibilityStore: visibility
  numHistoryShards: 512
  datastores:
    default:
      cassandra:
        hosts: "127.0.0.1"
        keyspace: "temporal"
        user: "username"
        password: "password"
    visibility:
      cassandra:
        hosts: "127.0.0.1"
        keyspace: "temporal_visibility"

Here is configuration reference.

sp13 · October 18, 2021, 5:58pm

It is 512.

sp13 · October 18, 2021, 5:59pm

Sorry if I misguided you. I was suppose to mean a testing environment. But it is not so busy.

maxim · October 18, 2021, 6:13pm

I would recommend checking your db update latency metrics. If the latency is high then even relatively small number of children can end up competing for the lock for the long time.

sp13 · October 18, 2021, 6:18pm

Sure, Yes I will have to check that. That could be a reason for slow processing of queues.

Topic		Replies	Views
Java: Potential deadlock detected while spawning child workflows in a loop Community Support	3	401	January 16, 2024
Expection on child workflow execution Community Support java-sdk	2	198	April 4, 2024
Parent workflow fails to process child completion, Temporal state machine calls wrong method Community Support java-sdk	1	37	March 27, 2025
Child-workflows + Signals Community Support java-sdk	7	3725	October 16, 2021
Duplicates in child workflows Community Support java-sdk , cadence	2	2066	August 5, 2020

Event corruption with child workflows? (Java SDK)

Related topics