Web Scraping use case: Handling Custom API Errors and retry

JavascriptMick · May 26, 2024, 3:15am

Hi All,

I’m implementing a web scraping solution with go, colly and temporal. I’ve figured most of it out but have a question about handling custom Errors and retry

My basic approach is

Scrape a main page (html + json) to retrieve API keys and a version number
Scrape a products API (rest/json) to retrieve product information (uses the keys and version number)

Trouble is, 2) takes a long time and the version number retrieved in step 1) can become out of date at any time, if the website rolls it over. I can fetch the new version number by doing 1) again but I don’t want to do this unless I need to because the site has anti-scrape measures and it will block me if I do the main page too often.

So the temporal setup is roughly like this

func ScrapeMainPageActivity(ctx context.Context) (InitialData, error) {
	apiKey, versionNumber := ScrapeMainPage()
	initialData := InitialData{
		APIKey: apiKey,
		VersionNumber: versionNumber,
	}
	return initialData, nil
}

func ScrapeWorkflow(ctx workflow.Context) error {
	// Scrape main page and retrieve initial information
	var initialData InitialData
	err := workflow.ExecuteActivity(ctx, ScrapeMainPageActivity).Get(ctx, &initialData)

        //now use api keys to do product fetching... Note special error if the version is out of date
       for _, category := range longListOfProductCategories {
              err := workflow.ExecuteActivity(ctx, ScrapeAPIActivity, initialData, category).Get(ctx, &searchResults)
	     if err != nil {
		     if (errors.Is(err, &VersionOutOfDateError{})) {
			     // ... need to execute ScrapeMainPageActivity again to get new version number.. how to handle?
		     }
	     }

so my question is, how would you handle this? I see a couple of options

A) Just fail the whole workflow so it tries again from the start including Step 1… not terrible … How would I do this?

B) build some sort of nested while loop into the workflow so that the VersionOutOfDateError busts out of the inner loop causing the main page scrape to go again… feels wrong and non-temporally

C) some other approach with child workflows or similar

Any advice appreciated, I’m new to temporal

awwx · May 26, 2024, 5:48am

OK, so if I’m understanding you correctly, first you get the “initial data” which includes the version number, and then get product information for each category in a list of categories. An attempt to get product information may fail if the version number has changed.

If getting the product information for a category fails, what do you want to do next? Do you want get the initial data again, and then continue where you left off? Or go back to the beginning? If you try again and the version number changes yet again, do you want to keep trying? Do you want to try a maximum number of times? Do you want to add some kind of delay between attempts?

Whatever steps you want the workflow to take, you can simply… program the workflow to do that. You say that having a nested while loop “feels wrong and non-temporally”, but there’s nothing wrong with having a nested loop in a workflow. If success means obtaining the product information for all the categories with the same version number, then it would be natural to have nested loops: the outer loop to try again if the version number changed, and the inner loop to iterate through the categories.

JavascriptMick · May 26, 2024, 1:48pm

thanks @awwx that’s helpful

Topic		Replies	Views
Different retry options for different kinds of errors Community Support go-sdk	9	650	May 22, 2024
Best practices in temporal in handling evolving API contracts Community Support java-sdk , activity	5	116	August 16, 2024
Error handling in workloads and when sending signals Community Support go-sdk	3	2503	November 22, 2020
Questions around activity errors, retry, and more complex error handling scenarios Community Support go-sdk , retries , error-handling	9	3678	July 23, 2021
When do we really use application error details field Community Support go-sdk , error-handling	2	1365	July 18, 2022

Web Scraping use case: Handling Custom API Errors and retry

Related topics