Performance of SignalWithStartWorkflow()

I’m trying to perform some stress test, and one part is to schedule big number of workflows.
Right now, in (3xCassandra + 3xTemporal + workers) configuration I was able to reach ~80 rps per node for client.SignalWithStartWorkflow() with Go SDK.
So total number of workflows created is below 240 per second.

All workflows are created for single task queue.
I run image temporalio/auto-setup:1.3.2 with NUM_HISTORY_SHARDS=512 env and dynamic config:

matching.numTaskqueueReadPartitions:
- value: 10
  constraints: {}
matching.numTaskqueueWritePartitions:
- value: 10
  constraints: {}

Cassandra cluster doesn’t seems to be overloaded. Workers are capable to deal with load.
Can I somehow improve the rate number per individual Temporal server?

> docker exec -ti temporal  tctl --ad 192.168.5.7:7233 taskqueue list-partition --tq EV_SENDER_TASK_QUEUE
   WORKFLOWTASKQUEUEPARTITION  |       HOST        
  EV_SENDER_TASK_QUEUE         | 192.168.5.9:7235  
  /_sys/EV_SENDER_TASK_QUEUE/1 | 192.168.5.9:7235  
  /_sys/EV_SENDER_TASK_QUEUE/2 | 192.168.5.8:7235  
  /_sys/EV_SENDER_TASK_QUEUE/3 | 192.168.5.7:7235  
  /_sys/EV_SENDER_TASK_QUEUE/4 | 192.168.5.8:7235  
  /_sys/EV_SENDER_TASK_QUEUE/5 | 192.168.5.8:7235  
  /_sys/EV_SENDER_TASK_QUEUE/6 | 192.168.5.9:7235  
  /_sys/EV_SENDER_TASK_QUEUE/7 | 192.168.5.9:7235  
  /_sys/EV_SENDER_TASK_QUEUE/8 | 192.168.5.7:7235  
  /_sys/EV_SENDER_TASK_QUEUE/9 | 192.168.5.8:7235  
   ACTIVITYTASKQUEUEPARTITION  |       HOST        
  EV_SENDER_TASK_QUEUE         | 192.168.5.9:7235  
  /_sys/EV_SENDER_TASK_QUEUE/1 | 192.168.5.9:7235  
  /_sys/EV_SENDER_TASK_QUEUE/2 | 192.168.5.8:7235  
  /_sys/EV_SENDER_TASK_QUEUE/3 | 192.168.5.7:7235  
  /_sys/EV_SENDER_TASK_QUEUE/4 | 192.168.5.8:7235  
  /_sys/EV_SENDER_TASK_QUEUE/5 | 192.168.5.8:7235  
  /_sys/EV_SENDER_TASK_QUEUE/6 | 192.168.5.9:7235  
  /_sys/EV_SENDER_TASK_QUEUE/7 | 192.168.5.9:7235  
  /_sys/EV_SENDER_TASK_QUEUE/8 | 192.168.5.7:7235  
  /_sys/EV_SENDER_TASK_QUEUE/9 | 192.168.5.8:7235  

and

> docker exec -ti temporal  tctl --ad 192.168.5.7:7233 adm tq desc --taskqueue EV_SENDER_TASK_QUEUE                                                                  
  READ LEVEL | ACK LEVEL | BACKLOG | LEASE START TASKID | LEASE END TASKID  
       34500 |     34500 |       0 |                  1 |           100000  

  WORKFLOW POLLER IDENTITY |   LAST ACCESS TIME    
  27472@p-10-web@   | 2020-12-02T15:34:03Z  
  25445@p-12-web@   | 2020-12-02T15:34:03Z  
  28313@p-11-web@   | 2020-12-02T15:34:03Z  
  25665@p-12-web@   | 2020-12-02T15:34:03Z  
  28203@p-11-web@   | 2020-12-02T15:34:03Z  
  27363@p-10-web@   | 2020-12-02T15:34:03Z  
  27582@p-10-web@   | 2020-12-02T15:34:03Z  
  25555@p-12-web@   | 2020-12-02T15:34:03Z  
  28093@p-11-web@   | 2020-12-02T15:34:03Z  

My goal is ability to process spikes with 1000 new workflows per second.

Thanks!

1 Like

Hey Sergle,

Just to clarify, you want to the API call for client.SignalWithStartWorkflow() to succeed at up to 1000 requests per second across the entire cluster, but you are currently only achieving 240 requests per second. Is this correct?

Can you share / describe the code that is invoking SignalWithStartWorkflow in your benchmark scenario?

Thanks,
Manu

few questions:

  1. how many shards? ref: https://github.com/temporalio/temporal/blob/master/config/development.yaml#L4
  2. how many workflow being signal & started concurrently?

Hi, Manu

It looks something like this. I started one generating app for each Temporal server (3 in my setup).

// measure actual requests per second
var cnt uint64
go (func() {
    var prev_c uint64
    for {
        time.Sleep(1 * time.Minute)
        log.Printf("Send RPS: %.2f (%d -> %d)\n", float64(cnt - prev_c) / 60.0, prev_c, cnt)
        prev_c = cnt
    }
})()

// import "go.uber.org/ratelimit"
rl := ratelimit.New( RPS )
log.Printf("Generating load with RPS %d\n", RPS)

options := client.StartWorkflowOptions{
                TaskQueue: "EV_SENDER_TASK_QUEUE",
                WorkflowTaskTimeout: 10*time.Minute,
            }

for {
        if finished {
                break
        }

        event := app.EventDetails{
            ... payload ...
        }

        // range, like 0 - 50000
        for i := min_wf; i < max_wf; i++ {
                rl.Take()

                if finished {
                        break
                }

                options.ID = fmt.Sprintf("event-sender-%06d", i)
                _, err := c.SignalWithStartWorkflow(context.Background(), options.ID, "SIGNAL_1", event, options, WF_TYPE)
                if err != nil {
                        log.Fatalln("error sending signal to %s workflow", options.ID, err)
                }
                cnt++
        }
}

512 shards
Most workflows are started, intermediately processed and completed - the number of workflows in open state is low (up to 40)

Hey Sergle,

So each instance of the App (of which there are 3 instances) is sending SignalWithStart sequentially in a for-loop, correct?
Can you try parallelizing the sends across multiple go-routines within a single instance of the App?

one more thing, plz try to use atomic with cnt since within the code sample cnt is accessed by multiple goroutine

Thank you, with go-routines it’s faster. Observed 950 rps with 30 threads
Uploaded it here https://github.com/sergle/stress_test/blob/master/generator/main.go

When I have several Temporal servers, should I distribute load between them or it’s not required?

When task queue is partitioned between several Temporal instances:

> docker exec -ti temporal  tctl --ad 192.168.5.7:7233 taskqueue list-partition --tq EV_SENDER_TASK_QUEUE
   WORKFLOWTASKQUEUEPARTITION   |       HOST         
  EV_SENDER_TASK_QUEUE          | 192.168.5.9:7235   
  /_sys/EV_SENDER_TASK_QUEUE/1  | 192.168.5.9:7235   
  /_sys/EV_SENDER_TASK_QUEUE/2  | 192.168.5.8:7235   
  /_sys/EV_SENDER_TASK_QUEUE/3  | 192.168.5.7:7235   
...

Should I start worker processes for each Temporal instance (192.168.5.7, 192.168.5.8, 192.168.5.9), or they would distribute the workflows between any connected worker?

Thanks!

Server side use namespace & workflow ID for load balancing, so as long as the load is targeting at different workflow.

The load balancing of task queue happens at Temporal frontend, so not additional step is required

Now when required rate of creating new workflows reached, I observed that execution rate drastically fell down. How can I find out where is the bottleneck?

execution rate meaning the SDK side of execution rate?

  1. Check the DB CPU / mem, whether DB itself is overloaded (also the persistence metrics: https://github.com/temporalio/temporal/blob/v1.3.2/common/metrics/defs.go#L1075-L1160)

  2. check if the server is taking too much processing tasks (https://github.com/temporalio/temporal/blob/v1.3.2/common/metrics/defs.go#L2108-L2109)

  3. sample the workflow history, looking for workflow task timeout

# 3 may be the cause, e,g, too many signals to a workflow overloading worker (SDK)