sergle
December 2, 2020, 3:36pm
1
I’m trying to perform some stress test, and one part is to schedule big number of workflows.
Right now, in (3xCassandra + 3xTemporal + workers) configuration I was able to reach ~80 rps per node for client.SignalWithStartWorkflow() with Go SDK.
So total number of workflows created is below 240 per second.
All workflows are created for single task queue.
I run image temporalio/auto-setup:1.3.2 with NUM_HISTORY_SHARDS=512 env and dynamic config:
matching.numTaskqueueReadPartitions:
- value: 10
constraints: {}
matching.numTaskqueueWritePartitions:
- value: 10
constraints: {}
Cassandra cluster doesn’t seems to be overloaded. Workers are capable to deal with load.
Can I somehow improve the rate number per individual Temporal server?
> docker exec -ti temporal tctl --ad 192.168.5.7:7233 taskqueue list-partition --tq EV_SENDER_TASK_QUEUE
WORKFLOWTASKQUEUEPARTITION | HOST
EV_SENDER_TASK_QUEUE | 192.168.5.9:7235
/_sys/EV_SENDER_TASK_QUEUE/1 | 192.168.5.9:7235
/_sys/EV_SENDER_TASK_QUEUE/2 | 192.168.5.8:7235
/_sys/EV_SENDER_TASK_QUEUE/3 | 192.168.5.7:7235
/_sys/EV_SENDER_TASK_QUEUE/4 | 192.168.5.8:7235
/_sys/EV_SENDER_TASK_QUEUE/5 | 192.168.5.8:7235
/_sys/EV_SENDER_TASK_QUEUE/6 | 192.168.5.9:7235
/_sys/EV_SENDER_TASK_QUEUE/7 | 192.168.5.9:7235
/_sys/EV_SENDER_TASK_QUEUE/8 | 192.168.5.7:7235
/_sys/EV_SENDER_TASK_QUEUE/9 | 192.168.5.8:7235
ACTIVITYTASKQUEUEPARTITION | HOST
EV_SENDER_TASK_QUEUE | 192.168.5.9:7235
/_sys/EV_SENDER_TASK_QUEUE/1 | 192.168.5.9:7235
/_sys/EV_SENDER_TASK_QUEUE/2 | 192.168.5.8:7235
/_sys/EV_SENDER_TASK_QUEUE/3 | 192.168.5.7:7235
/_sys/EV_SENDER_TASK_QUEUE/4 | 192.168.5.8:7235
/_sys/EV_SENDER_TASK_QUEUE/5 | 192.168.5.8:7235
/_sys/EV_SENDER_TASK_QUEUE/6 | 192.168.5.9:7235
/_sys/EV_SENDER_TASK_QUEUE/7 | 192.168.5.9:7235
/_sys/EV_SENDER_TASK_QUEUE/8 | 192.168.5.7:7235
/_sys/EV_SENDER_TASK_QUEUE/9 | 192.168.5.8:7235
and
> docker exec -ti temporal tctl --ad 192.168.5.7:7233 adm tq desc --taskqueue EV_SENDER_TASK_QUEUE
READ LEVEL | ACK LEVEL | BACKLOG | LEASE START TASKID | LEASE END TASKID
34500 | 34500 | 0 | 1 | 100000
WORKFLOW POLLER IDENTITY | LAST ACCESS TIME
27472@p-10-web@ | 2020-12-02T15:34:03Z
25445@p-12-web@ | 2020-12-02T15:34:03Z
28313@p-11-web@ | 2020-12-02T15:34:03Z
25665@p-12-web@ | 2020-12-02T15:34:03Z
28203@p-11-web@ | 2020-12-02T15:34:03Z
27363@p-10-web@ | 2020-12-02T15:34:03Z
27582@p-10-web@ | 2020-12-02T15:34:03Z
25555@p-12-web@ | 2020-12-02T15:34:03Z
28093@p-11-web@ | 2020-12-02T15:34:03Z
My goal is ability to process spikes with 1000 new workflows per second.
Thanks!
1 Like
manu
December 2, 2020, 5:54pm
3
Hey Sergle,
Just to clarify, you want to the API call for client.SignalWithStartWorkflow() to succeed at up to 1000 requests per second across the entire cluster, but you are currently only achieving 240 requests per second. Is this correct?
Can you share / describe the code that is invoking SignalWithStartWorkflow in your benchmark scenario?
Thanks,
Manu
sergle
December 2, 2020, 7:08pm
5
Hi, Manu
It looks something like this. I started one generating app for each Temporal server (3 in my setup).
// measure actual requests per second
var cnt uint64
go (func() {
var prev_c uint64
for {
time.Sleep(1 * time.Minute)
log.Printf("Send RPS: %.2f (%d -> %d)\n", float64(cnt - prev_c) / 60.0, prev_c, cnt)
prev_c = cnt
}
})()
// import "go.uber.org/ratelimit"
rl := ratelimit.New( RPS )
log.Printf("Generating load with RPS %d\n", RPS)
options := client.StartWorkflowOptions{
TaskQueue: "EV_SENDER_TASK_QUEUE",
WorkflowTaskTimeout: 10*time.Minute,
}
for {
if finished {
break
}
event := app.EventDetails{
... payload ...
}
// range, like 0 - 50000
for i := min_wf; i < max_wf; i++ {
rl.Take()
if finished {
break
}
options.ID = fmt.Sprintf("event-sender-%06d", i)
_, err := c.SignalWithStartWorkflow(context.Background(), options.ID, "SIGNAL_1", event, options, WF_TYPE)
if err != nil {
log.Fatalln("error sending signal to %s workflow", options.ID, err)
}
cnt++
}
}
sergle
December 2, 2020, 7:11pm
6
512 shards
Most workflows are started, intermediately processed and completed - the number of workflows in open state is low (up to 40)
manu
December 2, 2020, 7:28pm
7
Hey Sergle,
So each instance of the App (of which there are 3 instances) is sending SignalWithStart sequentially in a for-loop, correct?
Can you try parallelizing the sends across multiple go-routines within a single instance of the App?
sergle:
cnt
one more thing, plz try to use atomic with cnt since within the code sample cnt is accessed by multiple goroutine
sergle
December 3, 2020, 2:17pm
9
Thank you, with go-routines it’s faster. Observed 950 rps with 30 threads
Uploaded it here https://github.com/sergle/stress_test/blob/master/generator/main.go
When I have several Temporal servers, should I distribute load between them or it’s not required?
sergle
December 3, 2020, 3:32pm
10
When task queue is partitioned between several Temporal instances:
> docker exec -ti temporal tctl --ad 192.168.5.7:7233 taskqueue list-partition --tq EV_SENDER_TASK_QUEUE
WORKFLOWTASKQUEUEPARTITION | HOST
EV_SENDER_TASK_QUEUE | 192.168.5.9:7235
/_sys/EV_SENDER_TASK_QUEUE/1 | 192.168.5.9:7235
/_sys/EV_SENDER_TASK_QUEUE/2 | 192.168.5.8:7235
/_sys/EV_SENDER_TASK_QUEUE/3 | 192.168.5.7:7235
...
Should I start worker processes for each Temporal instance (192.168.5.7, 192.168.5.8, 192.168.5.9), or they would distribute the workflows between any connected worker?
Thanks!
Server side use namespace & workflow ID for load balancing, so as long as the load is targeting at different workflow.
sergle:
Should I start worker processes for each Temporal instance (192.168.5.7, 192.168.5.8, 192.168.5.9), or they would distribute the workflows between any connected worker?
The load balancing of task queue happens at Temporal frontend, so not additional step is required
sergle
December 3, 2020, 8:11pm
13
Now when required rate of creating new workflows reached, I observed that execution rate drastically fell down. How can I find out where is the bottleneck?
execution rate
meaning the SDK side of execution rate?
Check the DB CPU / mem, whether DB itself is overloaded (also the persistence metrics: https://github.com/temporalio/temporal/blob/v1.3.2/common/metrics/defs.go#L1075-L1160 )
check if the server is taking too much processing tasks (https://github.com/temporalio/temporal/blob/v1.3.2/common/metrics/defs.go#L2108-L2109 )
sample the workflow history, looking for workflow task timeout
# 3 may be the cause, e,g, too many signals to a workflow overloading worker (SDK)