Random heartbeat failures

Hello,

This is considering PHP implementation, I’m at my wits end here so I’m searching for help anywhere I can think of. If anyone could reproduce this problem on their end it would be really helpful. At least I would know this is not an issue of my specific setup.

A very detailed description of the issue:

Short version:
Sending out a heartbeat in activity method results in heartbeat on non running activity (at random times). The failed state for heartbeats seems to be set upon roadrunner launch ( rr serve command). If the activity worker launches with a ‘failed’ heartbeat state, it’ll continue failing upon Heartbeats until it, or the Workflows rr serve instance is killed and restarted. I’m still not very clear on what fixes it. The more roadrunner instances I run ( rr serve ) the higher the chance of this error.

The failing heartbeats throw: 'activity_pool_get_activity_context: heartbeat on non running activity' on tcp://127.0.0.1:6001

Or sometimes the worker (very rarely) enters a different state for heartbeats and failed with a socket error. IDK if it’s related:
socket_send(): unable to write to socket [32]: Broken pipe

Can you show your workflow history when this error happens? Probably best to add it to the GitHub issue as well.

Hello, do you mean this? Or should I post the json?

This history is generated by the code:

<?php
declare(strict_types=1);

namespace App\Command;

use Carbon\CarbonInterval;
use Temporal\Activity\ActivityOptions;
use Temporal\Workflow;
use Temporal\Workflow\WorkflowInterface;

#[WorkflowInterface]
class TemporalTestWorkflow
{
    #[Workflow\WorkflowMethod]
    public function launchActivity()
    {
        $options = ActivityOptions::new()
            ->withStartToCloseTimeout(CarbonInterval::seconds(2))
            ->withTaskQueue('hyperwallet_command_bus');

        Workflow::newUntypedActivityStub($options)->execute('testActivity');
    }
}
<?php

declare(strict_types=1);

namespace App;

use Temporal\Activity;
use Temporal\Activity\ActivityInterface;
use Temporal\Activity\ActivityMethod;

#[ActivityInterface]
class TestActivity
{
    #[ActivityMethod]
    public function testActivity()
    {
        Activity::heartbeat('test'); // This causes failure.
    }
}

Yes, would help to see the “Full Details” view, or the exported json, which you can get via tctl as well, for example:

tctl wf show -w <wfid> -r <runid> --output_filename myoutput.json

Also can you show the full text of your activity failure (expand the failure on your ActivityFailed event).

Here is the json, and a close up picture of the failure.:slight_smile:

Do you get the same error if your workflow and activity run on the same task queue?

Wait, do you not have an activity interface, just the impl?
How are you registering the activity with the worker? This is confusing. Your activity needs an activity interface, see the php samples.

I don’t think it does, it doesn’t influence the result either way. Dynamic activities registration at runtime and worker splitting - #2 by maxim In the examples it doesn’t;t have an interface for simplicities sake.

I’ll try testing that.

Ok, will try to test with Go and Java sdks and see what happens. Wondering why you would just heart beat once and complete the activity, but I guess it could catch an edge case where activity completion is registered before the heart beat.
Do you have a concrete test with a long-running activity that heart beats, that you use to test cancellations for example, or is this just a quick test to heart beat once?

I think it’s useless to test it with go and java sdks. It involves roadrunner which is PHP specific.

This is just a quick test I use. It’s fairly easy to recreate for me. I tried adding sleep in the activity code after heart beat, didn’t help/

Regarding Dynamic activities registration at runtime and worker splitting - #2 by maxim
yes your workflow is using untyped stub to invoke the activity, but the worker that listens to your “hyperwallet_command_bus” task queue still has to register that activity. And I believe it needs to have an activity interface.

Update - you are right, not needed for PHP SDK :slight_smile:

We will try to recreate this use case next week. My intuition tells me that your activity is somehow timed out…

1 Like

I can confirm I have the same error on the same roadrunner (workflow and activity on the same service basically) instance and the same queue.

Thanks! Looking forward to that. I think the same queue and same roadrunner instance thing will make the testing easier. Please note u need to restart the roadrunner instance sometimes, the chances of getting the “bad instance” is like 1 out of 3 in my case.

I tried to replicate this issue locally with 10000 concurrent workflows and resetting the activity pool as it goes… and nothing.

I simply can’t get to this issue. Based on a code I see… the only option for this condition to trigger is if the activity pool is being dropped/replaced (somehow) while the activity is pushing its data via RPC.

Can we try to focus on “dropped pipe” and other issues first? I want to make sure there are no other env specific error which might be causing this weird side-effect.

1 Like

Hello, just as I posted on the original github issue, I cannot reproduce it anymore with 2.6.3 myself. I think it’s fixed!! Thank you so much guys. This goes for both dropped pipe and heartbeat on non running activity. If anyone else is having this issue, update roadrunner to 2.6.3.

:bowing_man:

2 Likes