PHP - weird behaviour while load testing

Hi,
I have an issue with PHP while trying to make some load tests. I have a standard setup with Roardunner workers (exactly 4 of them). Temporal runs from auto-setup docker container, with mysql backend and with elastic.
My load testing happens in the following way:

  1. I have a script which inserts 10000 workflows called FORK (a parent workflow which creates exactly one child workflow)
  2. the child workflows are empty (I have some code for them to run an activity, but since I found some troubles while testing, I’m trying to find a minimum working example)
  3. while the script from 1) is executing, I’m running 4 temporal workers (wrapped in symfony but that shouldn’t be an issue)

There are no timers in any of the workflows, they should execute immediately. What I’m observing are two things:

  1. the workflows processing from the workers sometimes hangs for up to two minutes or even more, no idea why, I’m seeing hundreds of incomplete workflows in the UI and the workers do nothing
  2. only one of the configured workers seems to do some processing, the other 3 are waiting/idling

When watching the CPU utilisation in containers, when the worker is working as expected, it takes a normal amopunt of CPU and the temporal container (temporal is deployed from the auto setup) takes up to 2 CPU cores… When the worker is idling, the temporal container CPU runs on 6% … From my point of view it seems that the temporal worker is waiting for something but I have no idea what it can be, since the UI shows many running workflows and none of them is being scheduled…

This is the Fork workflow code:

final class DefaultForkingWorkflow implements ForkingWorkflow
{
    /**
     * @return Generator<int, CancellationScopeInterface, mixed, void>
     */
    // @todo remove namespace dependency https://github.com/temporalio/sdk-php/issues/104
    public function fork(Event $event, HandlerDefinitions $definitions, string $namespace)
    {
        //yield
        Workflow::asyncDetached(static function () use ($event, $definitions, $namespace) {
            $promises = [];

            foreach ($definitions as $definition) {
                $options = ChildWorkflowOptions::new()
                    ->withWorkflowTaskTimeout(CarbonInterval::seconds(10))
                    ->withParentClosePolicy(ParentClosePolicy::POLICY_ABANDON)
                    ->withNamespace($namespace)
                    ->withSearchAttributes([
                        SearchAttributes::ATTR_EVENT_ID => $event->getMetadata(Id::class)->toString(),
                        SearchAttributes::ATTR_HANDLER_ID => $definition->getServiceId(),
                        SearchAttributes::ATTR_HANDLER_METHOD => $definition->getHandlerMethod(),
                    ]);
                $handlerWorkflow = Workflow::newChildWorkflowStub(
                    HandlerWorkflow::class,
                    $options
                );

                $promises[] = Workflow::asyncDetached(
                    static fn () => /*yield*/ $handlerWorkflow->handle($event, $definition)
                );
            }

            //return yield
            Promise::all($promises);
        });
    }
}

I have also tried with yielding version (not sure what the difference is because they behave quite similarly):

final class DefaultForkingWorkflow implements ForkingWorkflow
{
    /**
     * @return Generator<int, CancellationScopeInterface, mixed, void>
     */
    // @todo remove namespace dependency https://github.com/temporalio/sdk-php/issues/104
    public function fork(Event $event, HandlerDefinitions $definitions, string $namespace)
    {
        yield Workflow::asyncDetached(static function () use ($event, $definitions, $namespace) {
            $promises = [];

            foreach ($definitions as $definition) {
                $options = ChildWorkflowOptions::new()
                    ->withWorkflowTaskTimeout(CarbonInterval::seconds(10))
                    ->withParentClosePolicy(ParentClosePolicy::POLICY_ABANDON)
                    ->withNamespace($namespace)
                    ->withSearchAttributes([
                        SearchAttributes::ATTR_EVENT_ID => $event->getMetadata(Id::class)->toString(),
                        SearchAttributes::ATTR_HANDLER_ID => $definition->getServiceId(),
                        SearchAttributes::ATTR_HANDLER_METHOD => $definition->getHandlerMethod(),
                    ]);
                $handlerWorkflow = Workflow::newChildWorkflowStub(
                    HandlerWorkflow::class,
                    $options
                );

                $promises[] = Workflow::asyncDetached(
                    static fn () => yield $handlerWorkflow->handle($event, $definition)
                );
            }

            return yield Promise::all($promises);
        });
    }
}

This is my .rr.yaml:

rpc:
  listen: tcp://127.0.0.1:6001

server:
   command: "/srv/www/src/LoadTest/bin/console event-bus:temporal-worker:run"

temporal:
  address: ${TEMPORAL_CLI_ADDRESS}
  namespace: ${TEMPORAL_NAMESPACE}
  activities:
    num_workers: 4
    allocate_timeout: 10s
    destroy_timeout: 20s

logs:
  mode: production
  level: debug
  output: stderr
  channels:
    temporal:
      level: debug
    informer:
      mode: production

All the testing is executed locally in docker containers.

I’m not really sure what the problem might be, could someone help me please?

Thank you.

How is your Temporal server setup? The docker-compose based Temporal setup is not intended for any load testing.

It is the standard auto setup… With mysql and elastic… It is not really load testing I suppose, because it’s just 10000 messages. I’m doing this with the auto setup because in the real deployment I observed similar behavior, so I wanted to check locally if there is something wrong with the deployment or with the code…

This setup is not set for ANY load testing even just 10k workflows. It is intended for development only. The appropriate setups can handle 10k+ workflows/second.

My point here is - has anybody tested temporal with PHP? Can’t there be something wrong with the roadrunner plug-in, when everything is idling?

Hi,

I’ve been testing RR locally with the PHP plugin and observed similar behavior when using Cassandra as the backend (when the whole system froze for a few minutes periodically).

After a deeper investigation, the issue was found outside of the plugin (on the polling end). However, I’ve been able to achieve 10K concurrent workflows on auto-setup without many issues on the PostgresSQL database.

The RR plugin operates as a simple workflow task router/buffer, it does not contain any logic related to task balancing which can cause such an effect locally. Any issues with RR plugin most likely will be localized via WorkflowTaskTimeout, not general failure. The rest of the code is identical to the well tested Golang SDK.

Thanks for the info. BTW I’m not trying to achieve 10k concurrent workflows, just want to see how long will it take to process 10k workflows… And I would like to find out why the workers don’t operate continuously…

From what you’re writing it seems like roadrunner should be OK… I was thinking about a polling issue, but OK, I’ll try something else… I’ll try the postres backend then, thanks

BTW @Wolfy-J How did you discover the problem? Could you please describe some troubleshooting experiences?

I’ve been debugging it while writing the plugin itself, so it was via tweaking internals. I strongly recommend to test it on production setup. One in docker tend to behave erratically under the load.

ok, so I tested this with the auto setup and also with a more production like setup in kubernetes (with 512 history shards)

The whole system behaves basically the same way on both deployments. But I found out, that the delays aren’t occurring when I’m running multiple roadrunner processes (each with 10 workers) in parallel. Also, I found out, that when running with only 1 roardunner process (10 workers), the delays seem to take exactly 1 minute. It seems to me like there is a timeout somewhere which says, that when a workflow isn’t able to be scheduled/processed on a worker in some given time, it needs to wait another minute to get rescheduled. Is something like that possible?

in the picture, the one minute delay is visible in the WorkflowTaskStarted event. Sometimes the reschedule takes 2 minutes (in the workflows that have been started as the last ones)

Any ideas what I’m doing wrong?

@Wolfy-J To me it seems like some roadrunners get stuck while processing, but other not, so in the end, the delays aren’t happening with multiple road runners (or at least there is a lower chance of getting stuck).

This indicates that the temporal deployment should be quite ok, it is able to handle the started workflows (currently I’m testing only 1000 starts)…

Can’t there by a hardcoded timeout somewhere in the road runner plugin?

No, there are no hardcoded timeouts on the roadrunner end. But we will do some extra checks…

Edit: actually we do on pool allocation, I asked dev to join this chain.

I was also able to reproduce the behaviour on a modified sample from the samples repository (GitHub - temporalio/samples-php: Temporal PHP SDK samples)

I am running the polymorhpic example but I modified the ExecuteCommand to this:

class ExecuteCommand extends Command
{
    protected const NAME = 'polymorphic';
    protected const DESCRIPTION = 'Execute PolymorphicActivity\GreetingWorkflow';

    public function execute(InputInterface $input, OutputInterface $output)
    {


        $output->writeln("Starting <comment>GreetingWorkflow</comment>... ");

        for ($i = 0; $i < 2000; $i++) {
            $workflow = $this->workflowClient->newWorkflowStub(
                GreetingWorkflowInterface::class,
                WorkflowOptions::new()//->withWorkflowExecutionTimeout(CarbonInterval::minute())
            );
            $run = $this->workflowClient->start($workflow, 'Antony');
        }
//        $output->writeln(
//            sprintf(
//                'Started: WorkflowID=<fg=magenta>%s</fg=magenta>, RunID=<fg=magenta>%s</fg=magenta>',
//                $run->getExecution()->getID(),
//                $run->getExecution()->getRunID(),
//            )
//        );
//
//        $output->writeln(sprintf("Result:\n<info>%s</info>", $run->getResult()));

        return self::SUCCESS;
    }
}

and the GreetingWorkflow to this:

class GreetingWorkflow implements GreetingWorkflowInterface
{
    /** @var GreetingWorkflowInterface[] */
    private $greetingActivities = [];

    public function __construct()
    {
        $this->greetingActivities = [
            Workflow::newActivityStub(
                HelloActivity::class,
                ActivityOptions::new()->withScheduleToCloseTimeout(CarbonInterval::seconds(20))
            ),
            Workflow::newActivityStub(
                ByeActivity::class,
                ActivityOptions::new()->withScheduleToCloseTimeout(CarbonInterval::seconds(20))
            ),
        ];
    }

    public function greet(string $name): \Generator
    {
        $result = [];
        foreach ($this->greetingActivities as $activity) {
            $result[] = yield $activity->composeGreeting($name);
        }

        return join("\n", $result);
    }
}

It doesn’t occur always but running multiple times after each other should make the problem appear

Actually, I might know the case… do you anyhow limit the number of workflow executions per worker? via worker options in your php code

hey @rastusik , so the problem is actually in the RR internal pool timeout. By default, it uses 60s timeout to get a free worker to process a request. I’ll make it unlimited in the next version.
*or possibly configurable.

Cool, thanks for the info @rustatian ! Is there any ETA?

@rastusik If that urgent, I can create a beta version for you tomorrow. If not - with a RR 2.4.0 release (middle of July).

@rustatian well, this would help our company to adopt temporal earlier and to get it into production, but I can wait, just a few people will be dissapointed :slight_smile:

Maybe I could try to make the fix if you could show me the file where the problem is

@rastusik got u. You’ll get a beta tomorrow :slight_smile: Could you please create a ticket in the roadrunner-temporal repository, so, I can ping you when the beta will be ready.

1 Like

@rustatian thanks a lot! yeah, I’m going to create the ticket now

2 Likes