Event corruption with child workflows? (Java SDK)

The code will work. The workflow will complete. But around 6s you might see a workflowtasktimeout (ScheduletoStart)

I see. Yes I see the timeout. It happens because 150 child workflows try to start as fast as possible and all of them compete with the workflow task for the shard lock.

Okay, if this is by design then that is what it is. Problem is in a busy system we are seeing this task time out for just 6 childworkflows.

How many shards is your busy system is provisioned with?

And when I said busy, it is just to an extent busy in getting workflowHistory for UI and not workflow execution. This is the configurations I have. We have a 3 node cluster in each service.

dynamicConfig:
    matching.numTaskqueueReadPartitions:
    - value: 5
      constraints: {}
    matching.numTaskqueueWritePartitions:
    - value: 5
      constraints: {}

If you comment the last line in my code, Promise.allOf(results).get(); I get the error for even 20 child workflows.

This is task queue partitions configuration. What is the number of history shards?

persistence:
  defaultStore: default
  visibilityStore: visibility
  numHistoryShards: 512
  datastores:
    default:
      cassandra:
        hosts: "127.0.0.1"
        keyspace: "temporal"
        user: "username"
        password: "password"
    visibility:
      cassandra:
        hosts: "127.0.0.1"
        keyspace: "temporal_visibility"

Here is configuration reference.

It is 512.

Sorry if I misguided you. I was suppose to mean a testing environment. But it is not so busy.

I would recommend checking your db update latency metrics. If the latency is high then even relatively small number of children can end up competing for the lock for the long time.

1 Like

Sure, Yes I will have to check that. That could be a reason for slow processing of queues.