Random heartbeat failures

Zigmas_Satkevicius · November 26, 2021, 1:58pm

Hello,

This is considering PHP implementation, I’m at my wits end here so I’m searching for help anywhere I can think of. If anyone could reproduce this problem on their end it would be really helpful. At least I would know this is not an issue of my specific setup.

A very detailed description of the issue:

github.com/temporalio/roadrunner-temporal

[BUG?] Heartbeat on non running activity

opened 10:44AM - 26 Oct 21 UTC

Zylius

F-need-verification

### Describe the bug Sending out a heartbeat in activity method results in `…heartbeat on non running activity` (at random times). The failed state for heartbeats seems to be set upon roadrunner launch (`rr serve` command). If the activity worker launches with a 'failed' heartbeat state, it'll continue failing upon Heartbeats until it, or the Workflows `rr serve` instance is killed and restarted. I'm still not very clear on what fixes it. The more roadrunner instances I run (`rr serve`) the higher the chance of this error. The failing heartbeats throw: `'activity_pool_get_activity_context: heartbeat on non running activity' on tcp://127.0.0.1:6001 ` Or sometimes the worker (very rarely) enters a different state for heartbeats and failed with a socket error. IDK if it's related: `socket_send(): unable to write to socket [32]: Broken pipe ` ### To Reproduce 1. Create a dummy workflow. 2. Create a dummy activity. 3. Send a heartbeat in the activity 4. Launch `rr serve` commands for the workflow part and activities separately. In our system workflows are contained in a single app, and activities are handled by different service apps. The more `rr serve` instances for the activity app is launched, the higher the chance one will see the error. It's pretty easy to recreate, it doesn't require many workers. I would say on my end with one worker it's about 50/50 to see the heartbeat error. 5. Launch the workflow 6. See the error. ### Expected behavior Heartbeat to be sent out successfully. ### Screenshots/Terminal output #### Errors ``` Temporal\Exception\TransportException: Error 'activity_pool_get_activity_context: heartbeat on non running activity' on tcp://127.0.0.1:6001 Spiral\Goridge\RPC\Exception\ServiceException: Error 'activity_pool_get_activity_context: heartbeat on non running activity' on tcp://127.0.0.1:6001 in /home/docker/services/hyperwallet/vendor/spiral/goridge/src/RPC/RPC.php:128 Stack trace: #0 /home/docker/services/hyperwallet/vendor/spiral/goridge/src/RPC/RPC.php(98): Spiral\Goridge\RPC\RPC->decodeResponse(Object(Spiral\Goridge\Frame), NULL) #1 /home/docker/services/hyperwallet/vendor/temporal/sdk/src/Worker/Transport/Goridge.php(56): Spiral\Goridge\RPC\RPC->call('activities.Reco...', Array) #2 /home/docker/services/hyperwallet/vendor/temporal/sdk/src/Internal/Activity/ActivityContext.php(137): Temporal\Worker\Transport\Goridge->call('activities.Reco...', Array) ``` Very rare (dunno if related) ``` ErrorException: Warning: socket_send(): unable to write to socket [32]: Broken pipe ErrorException: Warning: socket_send(): unable to write to socket [32]: Broken pipe in /home/docker/services/hyperwallet/vendor/spiral/goridge/src/SocketRelay.php:234 Stack trace: #0 /home/docker/services/hyperwallet/vendor/spiral/goridge/src/RPC/RPC.php(83): Spiral\Goridge\SocketRelay->send(Object(Spiral\Goridge\Frame)) #1 /home/docker/services/hyperwallet/vendor/temporal/sdk/src/Worker/Transport/Goridge.php(56): Spiral\Goridge\RPC\RPC->call('activities.Reco...', Array) #2 /home/docker/services/hyperwallet/vendor/temporal/sdk/src/Internal/Activity/ActivityContext.php(137): Temporal\Worker\Transport\Goridge->call('activities.Reco...', Array) ``` #### Code samples: ```php <?php declare(strict_types=1); namespace App\Command; use Carbon\CarbonInterval; use Temporal\Activity\ActivityOptions; use Temporal\Workflow; use Temporal\Workflow\WorkflowInterface; #[WorkflowInterface] class TemporalTestWorkflow { #[Workflow\WorkflowMethod] public function launchActivity() { $options = ActivityOptions::new() ->withStartToCloseTimeout(CarbonInterval::seconds(2)) ->withTaskQueue('hyperwallet_command_bus'); Workflow::newUntypedActivityStub($options)->execute('testActivity'); } } ``` ```php <?php declare(strict_types=1); namespace App; use Temporal\Activity; use Temporal\Activity\ActivityInterface; use Temporal\Activity\ActivityMethod; #[ActivityInterface] class TestActivity { #[ActivityMethod] public function testActivity() { Activity::heartbeat('test'); // This causes failure. } } ``` ### Versions - OS: In Docker `Debian GNU/Linux 10 (buster)`, host `Pop!_OS 21.04` - Temporal Version: 1.12.2 - Roadrunner version: 2.5.2 - PHP composer versions: - spiral/roadrunner: 2.5.0 - spiral/roadrunner-cli: 2.0.11 - spiral/roadrunner-http: 2.0.4 - spiral/roadrunner-worker: 2.1.3 - temporal/sdk: 1.0.4 - We're using docker image `temporalio/auto-setup` in our `docker-compose.yaml`, we've seen this error with sample temporal `docker-compose-mysql-es.yaml` files as well. ### Additional context I'm not sure, maybe this is an issue only with our setup, but any help would be greatly appreciated.

Short version:
Sending out a heartbeat in activity method results in heartbeat on non running activity (at random times). The failed state for heartbeats seems to be set upon roadrunner launch ( rr serve command). If the activity worker launches with a ‘failed’ heartbeat state, it’ll continue failing upon Heartbeats until it, or the Workflows rr serve instance is killed and restarted. I’m still not very clear on what fixes it. The more roadrunner instances I run ( rr serve ) the higher the chance of this error.

The failing heartbeats throw: 'activity_pool_get_activity_context: heartbeat on non running activity' on tcp://127.0.0.1:6001

Or sometimes the worker (very rarely) enters a different state for heartbeats and failed with a socket error. IDK if it’s related:
socket_send(): unable to write to socket [32]: Broken pipe

tihomir · November 26, 2021, 3:14pm

Can you show your workflow history when this error happens? Probably best to add it to the GitHub issue as well.

Zigmas_Satkevicius · November 26, 2021, 4:04pm

Hello, do you mean this? Or should I post the json?

This history is generated by the code:

<?php
declare(strict_types=1);

namespace App\Command;

use Carbon\CarbonInterval;
use Temporal\Activity\ActivityOptions;
use Temporal\Workflow;
use Temporal\Workflow\WorkflowInterface;

#[WorkflowInterface]
class TemporalTestWorkflow
{
    #[Workflow\WorkflowMethod]
    public function launchActivity()
    {
        $options = ActivityOptions::new()
            ->withStartToCloseTimeout(CarbonInterval::seconds(2))
            ->withTaskQueue('hyperwallet_command_bus');

        Workflow::newUntypedActivityStub($options)->execute('testActivity');
    }
}

<?php

declare(strict_types=1);

namespace App;

use Temporal\Activity;
use Temporal\Activity\ActivityInterface;
use Temporal\Activity\ActivityMethod;

#[ActivityInterface]
class TestActivity
{
    #[ActivityMethod]
    public function testActivity()
    {
        Activity::heartbeat('test'); // This causes failure.
    }
}

tihomir · November 26, 2021, 4:07pm

Yes, would help to see the “Full Details” view, or the exported json, which you can get via tctl as well, for example:

tctl wf show -w <wfid> -r <runid> --output_filename myoutput.json

Also can you show the full text of your activity failure (expand the failure on your ActivityFailed event).

Zigmas_Satkevicius · November 26, 2021, 4:14pm

Here is the json, and a close up picture of the failure.

tihomir · November 26, 2021, 5:07pm

Do you get the same error if your workflow and activity run on the same task queue?

tihomir · November 26, 2021, 5:13pm

Wait, do you not have an activity interface, just the impl?
How are you registering the activity with the worker? This is confusing. Your activity needs an activity interface, see the php samples.

Zigmas_Satkevicius · November 26, 2021, 5:23pm

I don’t think it does, it doesn’t influence the result either way. Dynamic activities registration at runtime and worker splitting - #2 by maxim In the examples it doesn’t;t have an interface for simplicities sake.

Zigmas_Satkevicius · November 26, 2021, 5:23pm

I’ll try testing that.

tihomir · November 26, 2021, 5:30pm

Ok, will try to test with Go and Java sdks and see what happens. Wondering why you would just heart beat once and complete the activity, but I guess it could catch an edge case where activity completion is registered before the heart beat.
Do you have a concrete test with a long-running activity that heart beats, that you use to test cancellations for example, or is this just a quick test to heart beat once?

Zigmas_Satkevicius · November 26, 2021, 5:33pm

I think it’s useless to test it with go and java sdks. It involves roadrunner which is PHP specific.

This is just a quick test I use. It’s fairly easy to recreate for me. I tried adding sleep in the activity code after heart beat, didn’t help/

tihomir · November 26, 2021, 5:40pm

Regarding Dynamic activities registration at runtime and worker splitting - #2 by maxim
yes your workflow is using untyped stub to invoke the activity, but the worker that listens to your “hyperwallet_command_bus” task queue still has to register that activity. And I believe it needs to have an activity interface.

Update - you are right, not needed for PHP SDK

Wolfy-J · November 26, 2021, 6:12pm

We will try to recreate this use case next week. My intuition tells me that your activity is somehow timed out…

Zigmas_Satkevicius · November 27, 2021, 5:10pm

I can confirm I have the same error on the same roadrunner (workflow and activity on the same service basically) instance and the same queue.

Thanks! Looking forward to that. I think the same queue and same roadrunner instance thing will make the testing easier. Please note u need to restart the roadrunner instance sometimes, the chances of getting the “bad instance” is like 1 out of 3 in my case.

Wolfy-J · December 3, 2021, 2:46pm

I tried to replicate this issue locally with 10000 concurrent workflows and resetting the activity pool as it goes… and nothing.

I simply can’t get to this issue. Based on a code I see… the only option for this condition to trigger is if the activity pool is being dropped/replaced (somehow) while the activity is pushing its data via RPC.

Can we try to focus on “dropped pipe” and other issues first? I want to make sure there are no other env specific error which might be causing this weird side-effect.

Zigmas_Satkevicius · December 3, 2021, 6:26pm

Hello, just as I posted on the original github issue, I cannot reproduce it anymore with 2.6.3 myself. I think it’s fixed!! Thank you so much guys. This goes for both dropped pipe and heartbeat on non running activity. If anyone else is having this issue, update roadrunner to 2.6.3.

Topic		Replies	Views
Retry with different behavior Community Support go-sdk	3	394	December 4, 2022
Observing issues with heartbeat in case of processing tasks Community Support python-sdk , general-impl , activity , heartbeat , best-practices	1	40	February 28, 2025
Cadence activity retries not working Community Support cadence	6	1046	March 8, 2021
Testing robustness to worker failure Community Support go-sdk , testing	1	413	October 30, 2023
WorkflowTaskFailed: Failure handling event 5 of 'EVENT TYPE ACTIVITY TASK SCHEDULED' Community Support	3	2620	March 16, 2021

Random heartbeat failures

Related topics