edu.uci.ics.crawler4j.crawler.CrawlController.addSeed(String) should omit "throws IOException" #61

brbog · 2022-04-25T15:26:54Z

Hi,

Inside the core project, edu.uci.ics.crawler4j.crawler.CrawlController line 443:
The addSeed()-method declares throwing the checked exception IOException, but that will never happen and can be removed.
(low prio as the addSeeds()-method is probably more used)

More context should you be interested:
I was trying to detect/construct extra links to crawl inside the visit()-method of a custom WebCrawler class (lots of JavaScript frameworks take over the navigation capabilities of anchors with href so these links are not automatically detected as outgoing links). At first I found myself duplicating the logic that sets the correct WebURL values and later schedule the urls through:

getMyController().getFrontier().scheduleAll(toSchedule);

Later I found that I might also be calling the addSeeds()-method. I still need to test this, so I can be wrong.

The use-case "add extra links to crawl while parsing a page inside a custom WebCrawler" seems interesting enough to add to the documentation?

Thanks for the continued effort in this nice framework and I hope this issue is helpful.

Best regards,
Bram

The text was updated successfully, but these errors were encountered:

rzo1 · 2022-04-25T16:55:44Z

Thanks for the issue. You are right. The IOException doesn't make sense. It will be fixed in the next version.

AFAIK, you should be able to use addSeed(...) on the fly. It basically does a scheduleAll(...) but performs some additional things (such as robots checking).

If you like, you can open a PR with a documentation enhancement. Happy that someone is using our fork!

rzo1 closed this as completed in 244c2c8 Apr 25, 2022

rzo1 self-assigned this Apr 25, 2022

rzo1 added the enhancement label Apr 25, 2022

rzo1 added this to the v4.8.4 milestone Apr 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

edu.uci.ics.crawler4j.crawler.CrawlController.addSeed(String) should omit "throws IOException" #61

edu.uci.ics.crawler4j.crawler.CrawlController.addSeed(String) should omit "throws IOException" #61

brbog commented Apr 25, 2022

rzo1 commented Apr 25, 2022

edu.uci.ics.crawler4j.crawler.CrawlController.addSeed(String) should omit "throws IOException" #61

edu.uci.ics.crawler4j.crawler.CrawlController.addSeed(String) should omit "throws IOException" #61

Comments

brbog commented Apr 25, 2022

rzo1 commented Apr 25, 2022