Web Scraping use case: Handling Custom API Errors and retry

Hi All,

I’m implementing a web scraping solution with go, colly and temporal. I’ve figured most of it out but have a question about handling custom Errors and retry

My basic approach is

  1. Scrape a main page (html + json) to retrieve API keys and a version number
  2. Scrape a products API (rest/json) to retrieve product information (uses the keys and version number)

Trouble is, 2) takes a long time and the version number retrieved in step 1) can become out of date at any time, if the website rolls it over. I can fetch the new version number by doing 1) again but I don’t want to do this unless I need to because the site has anti-scrape measures and it will block me if I do the main page too often.

So the temporal setup is roughly like this

func ScrapeMainPageActivity(ctx context.Context) (InitialData, error) {
	apiKey, versionNumber := ScrapeMainPage()
	initialData := InitialData{
		APIKey: apiKey,
		VersionNumber: versionNumber,
	}
	return initialData, nil
}

func ScrapeWorkflow(ctx workflow.Context) error {
	// Scrape main page and retrieve initial information
	var initialData InitialData
	err := workflow.ExecuteActivity(ctx, ScrapeMainPageActivity).Get(ctx, &initialData)

        //now use api keys to do product fetching... Note special error if the version is out of date
       for _, category := range longListOfProductCategories {
              err := workflow.ExecuteActivity(ctx, ScrapeAPIActivity, initialData, category).Get(ctx, &searchResults)
	     if err != nil {
		     if (errors.Is(err, &VersionOutOfDateError{})) {
			     // ... need to execute ScrapeMainPageActivity again to get new version number.. how to handle?
		     }
	     }

so my question is, how would you handle this? I see a couple of options

A) Just fail the whole workflow so it tries again from the start including Step 1… not terrible … How would I do this?

B) build some sort of nested while loop into the workflow so that the VersionOutOfDateError busts out of the inner loop causing the main page scrape to go again… feels wrong and non-temporally

C) some other approach with child workflows or similar

Any advice appreciated, I’m new to temporal

OK, so if I’m understanding you correctly, first you get the “initial data” which includes the version number, and then get product information for each category in a list of categories. An attempt to get product information may fail if the version number has changed.

If getting the product information for a category fails, what do you want to do next? Do you want get the initial data again, and then continue where you left off? Or go back to the beginning? If you try again and the version number changes yet again, do you want to keep trying? Do you want to try a maximum number of times? Do you want to add some kind of delay between attempts?

Whatever steps you want the workflow to take, you can simply… program the workflow to do that. You say that having a nested while loop “feels wrong and non-temporally”, but there’s nothing wrong with having a nested loop in a workflow. If success means obtaining the product information for all the categories with the same version number, then it would be natural to have nested loops: the outer loop to try again if the version number changed, and the inner loop to iterate through the categories.

1 Like

thanks @awwx that’s helpful