Skip to content

Crawling strategy is a cornerstone of Frontera

Compare
Choose a tag to compare
@sibiryakov sibiryakov released this 25 Jul 10:17
· 49 commits to master since this release

This is major release containing many architectural changes. The goal of these changes is make development and debugging of the crawling strategy easier. From now, there is an extensive guide in documentation on how to write a custom crawling strategy, a single process mode making much easier to debug crawling strategy locally and old distributed mode for production systems. Starting from this version there is no requirement to setup Apache Kafka or HBase to experiment with crawling strategies on your local computer.

We also removed unnecessary, rarely used features: distributed spiders run mode, prioritisation logic from backends to make Frontera easier to use and understand.

Here is a (somewhat) full change log:

  • PyPy (2.7.*) support,
  • Redis backend (kudos to @khellan),
  • LRU cache and two cache generations for HBaseStates,
  • Discovery crawling strategy, respecting robots.txt and leveraging sitemaps to discover links faster,
  • Breadth-first and depth-first crawling strategies,
  • new mandatory component in backend: DomainMetadata,
  • filter_links_extracted method in crawling strategy API to optimise calls to backends for state data,
  • create_request in crawling strategy is now using FronteraManager middlewares,
  • many batch gen instances,
  • support of latest kafka-python,
  • statistics are sent to message bus from all parts of Frontera,
  • overall reliability improvements,
  • settings for OverusedBuffer,
  • DBWorker was refactored and divided on components (kudos to @vshlapakov),
  • seeds addition can be done using s3 now,
  • Python 3.7 compatibility.