How to handle activity that can't be restarted and should send thousands of signals?

Hello I have the following use-case that I can’t understand how to implement with temporal:
I have a scraping platform that has two main steps for each crawl:

  1. Navigate through all webpages to extract the links that gives me the products details
  2. Then for each extracted link, fetch the page and extract the details

The issue I’m facing is how to integrate the existing library with temporal due to the hard events limit.

The step 1: navigation and links extraction, emits an Arrray of links for each page it crawls. The emission can be made as an event with EventEmitter in memory or pushed to a topic in Kafka. The problem here is that the links extraction can’t be restarted, is a single process type of thing, it goes from the beginning to the end and returns when all pages were crawled, so I wouldn’t be able to, say, start from page 4 using continueAsNew strategy. I started a POC implementing a signal approach for each page crawled, but I’ll for sure hit the 50k events limit in the parent workflow in production.

The step 2: Is I think it’s easier because it’s a single action (get the url and extract the details) and can be 1-1 with a childWorkflow (right?)

I though of using temporal to orchestrate the steps and detect when a crawl was finished:

  • but how can I store in memory, thousands of links (it’s an object with some attributes and the link)?
  • or how can I use temporal correctly to deal with the extract-links step issue?

Some questions.

  1. What is the source of the list of webpages to extract the links? Can this source be paginated? What is the maximum number of results the source returns?
  2. What is the maximum number of links extracted from a single page?

The problem here is that the links extraction can’t be restarted,

Cannot be restarted for a single webpage? Or the source of pages to extract always returns a single array and cannot be paginated?

Wow!! Thanks for the super-fast feedback!
Let me answer:

  1. The crawler receives a crawlRequest from another API, this request is for a single specific crawler. This will trigger the start of the crawl - the navigation step. At this point I don’t know how many links the crawler will encounter inside the website (the crawler is specific to one site). It is possible to signal/emit the results for each page, but I can’t return the function unless all the pages were crawled. I’ve build several robots that find as much as 500k links in websites, but I would say the average is around 20-30k links extracted.
    The function is something like this:
// Small pseudo code for examplification
async function extractLinks(request: CrawlRequest){
  let hasNext = true
  let page = 1
  let links = new Set()
  while (hasNext) {
  
    const response = await get('http://')
    // somehow parse the links in the page
    const pageLinks = parseLinks(response)
    links = new Set([...links, ...pageLinks])
    let hasNext; // some logic to determine if there is a next page
    page += 1
    await produceDetailsRequest(pageLinks) // This would emit the links for each page (through signal, kafka or event)
  }
  return [...links]
}
  1. In a single page I believe is in the hundreds, ~100-700 (max) per page, until now I’ve not encountered bigger

  2. The extractLinks is a single function for everything navigation-wise (see the code above). The logic much like that for the majority case

I just saw another limitation with the amount of child-workflows that can exist inside a workflow… Is there any way of implementing batching without knowing the amount of records a step will have?

Adding more context to the use-case:
The boxes with the same color represents equivalent steps on the different strategies that a crawler can implement. The crawler can have all three implemented, but inside an execution, only one strategy will be used. The client that requests the crawl sends the strategy to be used in the request.

How would I deal with this non-deterministic scenario where there are 3 possible flows?

Sorry about the flood of messages, but I’m a little obsessed with finding a solution to this. Do you think this would be a good architecture?

I don’t think this will work well, it would put a lot of overhead on Temporal with the number of workflows you’ll end up spawning (>100K), the amount of signals you’ll send and overhead for maintaining parent-child relationships and continue-as-new complexity.
There are also some limitations you might hit like 2MB payload size limits for returning 100k links from an activity.

A different approach would be to rely more on activities as well as snapshotting activity state using periodic heartbeats to avoid losing some progress.

You could have the getInitialPages activity start a workflow for each batch of links directly instead of returning all links to the workflow, use a meaningful workflow ID to avoid creating duplicate workflows.

I’m not sure I understand the logic of the crawl-details-links workflow but IIUC, you’re adding more pages to the set using the addResultsPage signal as well as extracting product links from those pages.
Assuming I understand the flow there, I would use a similar approach where more work is done in the activity and avoid spawning a workflow per page and the extra signals.
You could maintain the set of links in-memory in the activity and as you progress and dispatch a crawl-details workflow with a batch containing multiple links using workflow ID for deduplication.
The crawl-details workflow would use a single activity for crawling all links.

This approach would remove a lot of complexity and overhead compared to your proposed approach.
The way you leverage heartbeats is dependent on how much checkpointing you need and how resumable each of these activity processes are. kKeep in mind that while the SDK will throttle activity heartbeats, they are not free server-side and have the same 2MB size limitation.
In case you want to store any intermediate state that’s larger than 2MB I suggest you use a blob store like S3 but that might not be needed for your use case so you might be okay using empty heartbeat details just to keep long activities alive and avoiding setting a long startToClose timeout.

1 Like

Hey Roey! Thanks for the comment! I’ll distill it to understand it better and probably come up with some questions if that’s ok.

I’m going to read more about heartbeat to understand its usage first.

The crawl-details-links workflow is responsible for entering a page and getting the links that redirects to a detailed item (such as a product details page). Then the crawl-details-workflow is responsible for getting each link extracted in the parent workflow and extracting the specific details of it

@bergundy About this:

The initialPage would be a category page such as category-page-example. The crawler enters this page/url and navigates through every page of this category (page 1, page 2, page n), since this amount of pages is unknown and could go as far as thousands of pages, I thought about a workflow per each category internal page is it still not a good idea?

The crawl-results-page-workflow works like a batching solution to group initialPages in a way that would be feasible. That was what I tried to achieve at least… haha

The final crawl-details-linkscrawl-details would be: For every categoryUrl(initialPage) the activity extractLinksFromPage() will spawn a child-workflow crawl-details-workflow that will process all the links found in that internal category page.