Hi All,
I’m implementing a web scraping solution with go, colly and temporal. I’ve figured most of it out but have a question about handling custom Errors and retry
My basic approach is
- Scrape a main page (html + json) to retrieve API keys and a version number
- Scrape a products API (rest/json) to retrieve product information (uses the keys and version number)
Trouble is, 2) takes a long time and the version number retrieved in step 1) can become out of date at any time, if the website rolls it over. I can fetch the new version number by doing 1) again but I don’t want to do this unless I need to because the site has anti-scrape measures and it will block me if I do the main page too often.
So the temporal setup is roughly like this
func ScrapeMainPageActivity(ctx context.Context) (InitialData, error) {
apiKey, versionNumber := ScrapeMainPage()
initialData := InitialData{
APIKey: apiKey,
VersionNumber: versionNumber,
}
return initialData, nil
}
func ScrapeWorkflow(ctx workflow.Context) error {
// Scrape main page and retrieve initial information
var initialData InitialData
err := workflow.ExecuteActivity(ctx, ScrapeMainPageActivity).Get(ctx, &initialData)
//now use api keys to do product fetching... Note special error if the version is out of date
for _, category := range longListOfProductCategories {
err := workflow.ExecuteActivity(ctx, ScrapeAPIActivity, initialData, category).Get(ctx, &searchResults)
if err != nil {
if (errors.Is(err, &VersionOutOfDateError{})) {
// ... need to execute ScrapeMainPageActivity again to get new version number.. how to handle?
}
}
so my question is, how would you handle this? I see a couple of options
A) Just fail the whole workflow so it tries again from the start including Step 1… not terrible … How would I do this?
B) build some sort of nested while loop into the workflow so that the VersionOutOfDateError busts out of the inner loop causing the main page scrape to go again… feels wrong and non-temporally
C) some other approach with child workflows or similar
Any advice appreciated, I’m new to temporal