Maru Spike Test Corrupts AWS RDS Postgresql DB (Context Deadline Exceeded)

Hi all, I may have found a bug with temporal when running heavy workloads.
Here’s a link to the report I made: https://github.com/temporalio/temporal/issues/3131

Does anyone else have any issues running heavy workloads that can corrupt the database? After the Maru test fails I can no longer run any workflows on temporal until I destroy and re-deploy the database.

Thanks,
Cameron

Just to note for the team I’m looking into this.

I can confirm that removing the Fargate profile, CNI plugin and using On-Demand as the capacity type for nodes resolved the issue for AWS EKS. Before, Fargate seemed to be tainting the pods and putting them all onto one node which was running out of resources. I also had to adjust the security profile of the node group to allow all traffic between them. Thanks @Rob_Temporal for all the help :slight_smile:

2 Likes