Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

edu.uci.ics.crawler4j.crawler.CrawlController.addSeed(String) should omit "throws IOException" #61

Closed
brbog opened this issue Apr 25, 2022 · 1 comment
Assignees
Milestone

Comments

@brbog
Copy link

brbog commented Apr 25, 2022

Hi,

Inside the core project, edu.uci.ics.crawler4j.crawler.CrawlController line 443:
The addSeed()-method declares throwing the checked exception IOException, but that will never happen and can be removed.
(low prio as the addSeeds()-method is probably more used)

More context should you be interested:
I was trying to detect/construct extra links to crawl inside the visit()-method of a custom WebCrawler class (lots of JavaScript frameworks take over the navigation capabilities of anchors with href so these links are not automatically detected as outgoing links). At first I found myself duplicating the logic that sets the correct WebURL values and later schedule the urls through:

getMyController().getFrontier().scheduleAll(toSchedule);

Later I found that I might also be calling the addSeeds()-method. I still need to test this, so I can be wrong.

The use-case "add extra links to crawl while parsing a page inside a custom WebCrawler" seems interesting enough to add to the documentation?

Thanks for the continued effort in this nice framework and I hope this issue is helpful.

Best regards,
Bram

@rzo1 rzo1 closed this as completed in 244c2c8 Apr 25, 2022
@rzo1
Copy link
Collaborator

rzo1 commented Apr 25, 2022

Thanks for the issue. You are right. The IOException doesn't make sense. It will be fixed in the next version.

AFAIK, you should be able to use addSeed(...) on the fly. It basically does a scheduleAll(...) but performs some additional things (such as robots checking).

If you like, you can open a PR with a documentation enhancement. Happy that someone is using our fork!

@rzo1 rzo1 self-assigned this Apr 25, 2022
@rzo1 rzo1 added this to the v4.8.4 milestone Apr 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants