Skip to content

SimZal/crawler

 
 

Repository files navigation

Crawl links on a website

Latest Version on Packagist Software License Build Status SensioLabsInsight Quality Score Total Downloads

THIS IS A FORK OF THE SPATIE CRAWLER. IT ADDS A CALLBACK FUNCTION TO RECIEVE ALL THE LINKS ON THE CRAWLED PAGE.

This package provides a class to crawl links on a website.

Spatie is a webdesign agency in Antwerp, Belgium. You'll find an overview of all our open source projects on our website.

Installation

This package can be installed via Composer:

composer require spatie/crawler

Usage

The crawler can be instantiated like this

Crawler::create()
    ->setCrawlObserver(<implementation of \Spatie\Crawler\CrawlObserver>)
    ->startCrawling($url);

The argument passed to setObserver must be an instance that implement the \Spatie\Crawler\CrawlObserver-interface:

/**
 * Called when the crawler will crawl the given url.
 *
 * @param \Spatie\Crawler\Url $url
 */
public function willCrawl(Url $url);

/**
 * Called when the crawler has crawled the given url.
 *
 * @param \Spatie\Crawler\Url       $url
 * @param \Psr\Http\Message\ResponseInterface $response
 */
public function hasBeenCrawled(Url $url, ResponseInterface $response);

/**
 * Called when the crawler has found links on the page
 *
 * @param \SimZal\Crawler\Url                       $url
 * @param \Illuminate\Support\Collection            $links
 */
public function foundLinks(Url $url, $links);

/**
 * Called when the crawl has ended.
 */
public function finishedCrawling();

Filtering certain url's

You can tell the crawler not to visit certain url's by passing using the setCrawlProfile-function. That function expects an objects that implements the Spatie\Crawler\CrawlProfile-interface:

/**
 * Determine if the given url should be crawled.
 *
 * @param \Spatie\Crawler\Url $url
 *
 * @return bool
 */
public function shouldCrawl(Url $url);

Changelog

Please see CHANGELOG for more information what has changed recently.

Contributing

Please see CONTRIBUTING for details.

Security

If you discover any security related issues, please email [email protected] instead of using the issue tracker.

Credits

About Spatie

Spatie is a webdesign agency in Antwerp, Belgium. You'll find an overview of all our open source projects on our website.

License

The MIT License (MIT). Please see License File for more information.