A simple, flexible crawler, written in PHP.
This little project is easily installable and enables you to
- easily retrieve data from an HTML or XML file, both local and remote;
- crawl multiple pages efficiently (can handle pagination);
- not worry about DOM traversal at all.
Interesting features on a page can be found through XPath queries. On top of that, siflawler supports basic CSS selectors with an extension enabling the retrieval of attributes. This way, querying a page is easy even if you do not know XPath.
To be able to run siflawler, you will need to have the PHP cURL
extension installed. This is what
siflawler uses to download pages from the website(s) you want to crawl. This
does enable siflawler to download pages in parallel, amongst others. Please open
an issue if you have a problem with this and would like to see if bare PHP
file_get_contents
support can be added.
You can install siflawler using Composer. Either run
composer require 'caster/siflawler:~1.2.1'
or put the following in your
composer.json
and run composer install
.
{
"require": {
"caster/siflawler": "~1.2.1"
}
}
You are now set up to start crawling.
$config = // JSON string, path to JSON file, object, or associative array
$crawler = new \siflawler\Crawler($config);
$data = $crawler->crawl();
You may like to do crawling from the commandline. In that case, just run your file
as follows. siflawler will give some output letting you know what it is doing in
case you have set the verbose
option to true
(which it is by default).
$ php -f your-file.php
These are the options you can pass the \siflawler\Crawler
when constructing
it.
The following options are mandatory and siflawler will throw an exception if you forget to pass one of these to it. It simply needs to know what to do. To see what you can pass for the selectors/queries, refer to the Querying section.
{
"start": "some url you want to start crawling at",
"find": "some CSS selector or XPath query to select interesting elements",
"get": {
"key-1": "a CSS selector or an XPath query to select some value within a found element",
"key-2": "another XPath query, et cetera"
}
}
The start
option is simply the URL siflawler will look (first) for an HTML page to
crawl and get data from. This may also be an absolute path to some file on your
local disk. It can even be an array of URLs, paths, or a mix of those two.
The find
option can be used to specify how siflawler should locate interesting
elements on a page when it has been retrieved. For each interesting element, an
object will be created (stdClass
) that has the properties specified in the get
option. Each key in the get
option object should be a query indicating what to
put in that key in the resulting stdClass
object.
If you want to crawl multiple pages, use the next
option (see next section).
You can use the following optional options, which are all self-explanatory really. The below values are the default values, you only need to include options if you want to use a different value or want to be explicit.
{
"max_requests": 0,
"next": null,
"timeout": 0,
"verbose": true,
"warnings": true
}
The max_requests
option can be used to limit the number of pages siflawler will
request in total. A value of 0 or less means that there is no limit.
The next
option can be used to find one or more URLs to crawl next. If this is
null
, then no next page will be crawled, but you can specify a
query to find one or more locations to go to next. This is useful
when you want to crawl data that is split over multiple pages using pagination.
The timeout
option can be used to specify a timeout in seconds for each request.
A value of 0 means that there will be no timeout.
The verbose
and warnings
options can be used to toggle siflawler output.
Everywhere you can specify a query to find elements or attributes of elements, you can do two things: either specify an XPath query, or specify a CSS selector. CSS can normally only select nodes, but siflawler can understand some additional syntax that will allow you to select attribute values. Examples are:
a.nav$attr("href")
p#some-id$text()
Internally, siflawler will translate CSS selectors to XPath queries. If you want to be sure that this cannot go wrong, you should use XPath, but siflawler's CSS support is pretty good and can always be improved if you create an issue 🙂
To distinguish between CSS and XPath, siflawler uses a heuristic. If you want to
be sure that this does not go wrong, you can specify a query as
css:[your CSS selector]
or xpath:[your XPath query]
, to let siflawler know
precisely what the query language is you use.
This project uses phpunit for automated unit testing. You
can easily run the tests by executing composer test
. For that to work, you do
need to install the dev version of siflawler.
If you miss something in siflawler, found a problem or if you have something really cool to add to it, feel free to open an issue or pull request on GitHub. I will try to respond as quickly as possible.
siflawler - a simple, flexible crawler, written in PHP.
Copyright (C) 2015 Thom Castermans
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.