You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Inside the core project, edu.uci.ics.crawler4j.crawler.CrawlController line 443:
The addSeed()-method declares throwing the checked exception IOException, but that will never happen and can be removed.
(low prio as the addSeeds()-method is probably more used)
More context should you be interested:
I was trying to detect/construct extra links to crawl inside the visit()-method of a custom WebCrawler class (lots of JavaScript frameworks take over the navigation capabilities of anchors with href so these links are not automatically detected as outgoing links). At first I found myself duplicating the logic that sets the correct WebURL values and later schedule the urls through:
Thanks for the issue. You are right. The IOException doesn't make sense. It will be fixed in the next version.
AFAIK, you should be able to use addSeed(...) on the fly. It basically does a scheduleAll(...) but performs some additional things (such as robots checking).
If you like, you can open a PR with a documentation enhancement. Happy that someone is using our fork!
Hi,
Inside the core project,
edu.uci.ics.crawler4j.crawler.CrawlController
line 443:The
addSeed()
-method declares throwing the checked exceptionIOException
, but that will never happen and can be removed.(low prio as the
addSeeds()
-method is probably more used)More context should you be interested:
I was trying to detect/construct extra links to crawl inside the
visit()
-method of a customWebCrawler
class (lots of JavaScript frameworks take over the navigation capabilities of anchors with href so these links are not automatically detected as outgoing links). At first I found myself duplicating the logic that sets the correctWebURL
values and later schedule the urls through:Later I found that I might also be calling the
addSeeds()
-method. I still need to test this, so I can be wrong.The use-case "add extra links to crawl while parsing a page inside a custom WebCrawler" seems interesting enough to add to the documentation?
Thanks for the continued effort in this nice framework and I hope this issue is helpful.
Best regards,
Bram
The text was updated successfully, but these errors were encountered: