-
Notifications
You must be signed in to change notification settings - Fork 59
Scrapy implementation proposal
The crawler will be implemented in Scrapy. This page outlines the specification for such implementation, which should cover the requirements outlined in: Core#scrapy_implementation
Note that this is only a proposal and, as such, it may not be fully implemented or it may become outdated. For more up to date information, please refer to:
The core will trigger crawl tasks, which will translate to a spider run in Scrapy. Each crawl task (typically, for a single web entity) will be submitted by the core, along with a list of input parameters (aka. spider arguments), and generate some output after the crawl finishes.
The spider will start crawling at the start urls and save each page as a Page item (defined below), following all links that meet certain conditions (to be included in the crawler documentation).
Scrapy will store the full page on a long-term page storage, and inject a reduced item (without the body field) into a queue, that is consumed by the core.
- the crawler will run on scrapyd
- the core will interact with the crawler through scrapyd web service
- the scrapyd api is documented here
- the scrapyd api will be extended to support cancelling jobs and querying for pending, running and completed jobs
- The scraped items will be stored in a key-value store known as the "pagestore", serialized in JSON format
- The scraped items will also be put in a queue (without the body), also serialized in JSON format
- Both the queue and the pagestore will be implemented using Kyoto Cabinet which is a modern fast DBM that provides mechanisms for implementing queues (see this page for more info).
- There will be one queue per crawl job, and a single global pagestore.
Note that there won't be an input queue to the crawler. The input queue will be managed internally by scrapyd through the schedule.json
and the core will intera won't access it directly but through scrapyd api instead (by calling schedule.json
api).