Writing stress tests for Temporal

I’m working on hardening our Temporal deployments and am looking to stress test Temporal to get more familiar with its behavior, find worthwhile metrics to alert on and so-on. While looking around for prior work, I found the blog post of the stress testing Temporal does internally, as well as this post about Temporal as a load testing tool.

I’d like to build something similar - something that the community could potentially use for themselves if there’s interest. I’ve already written a mirror of the “rabbit” scenario, which works great so far.

I could use some input on the “reactor” scenario, though, specifically the way I’m approaching things:

I have an application that will be running the load tests, kicked off on-demand by a grpc request that looks something like the following:

message StartReactorLoadTestRequest {
  string id = 1;
  map<string, int32> dimensions = 2;
  google.protobuf.Duration maxDuration = 3;
}

When receiving this request, a ReactorWorkflow is started, which first creates N ReactorCellWorkflow child workflows (same purpose as the “Stats Aggregator Workflow” from the blog). The ReactorWorkflow then creates a series of different UseCaseWorkflow according to the dimensions request property, where the key is the name of the UseCaseWorkflow class, and the value the number of those workflows to create. The idea here is that over time, we’ll be able to build a load test profile that diverse and similar to the varied workloads we see in production.

Unlike the blog, the load test will not complete on the counter values, but instead just by test duration. I’d like the counter values instead used to inform the ReactorWorkflow to start UseCaseWorkflows according to the deficit so that the desired total number of use case workflows remains roughly constant.

I have two immediate implementation questions:

  1. The UseCaseWorkflows need to select a ReactorCellWorkflow at random. Presumably, this would be done via an activity, where I’d call WorfklowServiceStubs.listOpenWorkflowExecutions, filtering on the ReactorCellWorkflow type when the use case workflow starts?
  2. I want a nice “abort” button for a specific test. Since the UseCaseWorkflows are detached from the reactor workflows, submitting a single cancellation on the ReactorWorkflow won’t clean everything up. What would be the best way to cleanup the use case workflows?

As followup questions: Does this all make sense and do you have any recommendations or things I should consider before proceeding further?

  1. I would use workflowID as a way to select a random workflow. For example you can ID your workflows as 1 to 100 and let client choose one randomly.
  2. As IDs are known the workflow can have an activity that calls terminate to all of them iterating over IDs.
1 Like

Oh, the simplicity. Makes sense, thanks!