Skip to content

Releases: scrapinghub/frontera

Bug fix release

05 Apr 16:36
fd294e0
Compare
Choose a tag to compare

Some bugs are fixed due to dependencies update.

Examples and documentation updates

30 Jul 14:55
Compare
Choose a tag to compare
  • general-spider example is fixed,
  • SW crashes with ZeroMQ are fixed (stats output is wiped),
  • documentation update.

Crawling strategy is a cornerstone of Frontera

25 Jul 10:17
Compare
Choose a tag to compare

This is major release containing many architectural changes. The goal of these changes is make development and debugging of the crawling strategy easier. From now, there is an extensive guide in documentation on how to write a custom crawling strategy, a single process mode making much easier to debug crawling strategy locally and old distributed mode for production systems. Starting from this version there is no requirement to setup Apache Kafka or HBase to experiment with crawling strategies on your local computer.

We also removed unnecessary, rarely used features: distributed spiders run mode, prioritisation logic from backends to make Frontera easier to use and understand.

Here is a (somewhat) full change log:

  • PyPy (2.7.*) support,
  • Redis backend (kudos to @khellan),
  • LRU cache and two cache generations for HBaseStates,
  • Discovery crawling strategy, respecting robots.txt and leveraging sitemaps to discover links faster,
  • Breadth-first and depth-first crawling strategies,
  • new mandatory component in backend: DomainMetadata,
  • filter_links_extracted method in crawling strategy API to optimise calls to backends for state data,
  • create_request in crawling strategy is now using FronteraManager middlewares,
  • many batch gen instances,
  • support of latest kafka-python,
  • statistics are sent to message bus from all parts of Frontera,
  • overall reliability improvements,
  • settings for OverusedBuffer,
  • DBWorker was refactored and divided on components (kudos to @vshlapakov),
  • seeds addition can be done using s3 now,
  • Python 3.7 compatibility.

Codecs are serializing strings type and other improvements

09 Feb 14:26
Compare
Choose a tag to compare

Thanks to @voith, a problem introduced with beginning of support of Python 3 when Frontera was supporting only keys and values stored as bytes in .meta fields is now solved. Many Scrapy middlewares weren't working or working incorrectly. This is still not tested properly, so please report any bugs.

Other improvements include:

  • batched states refresh in crawling strategy,
  • proper access to redirects in Scrapy converters,
  • more readable and simple OverusedBuffer implementation,
  • examples, tests and docs fixes.

Thank you all, for your contributions!

Support of new Kafka API and other minor improvements

29 Nov 11:36
Compare
Choose a tag to compare

A long awaiting support of kafka-python 1.x.x client. Now Frontera is much more resistant to physical connectivity loss and is using new asynchronous Kafka API.
Other improvements:

  • SW consumes less CPU (because of rare state flushing),
  • requests creation api is changed in BaseCrawlingStrategy, and now it's batch oriented,
  • new article in the docs on cluster setup,
  • disable scoring log consumption option in DB worker,
  • fix of hbase drop table,
  • improved tests coverage.

Python 3 support and many more

18 Aug 09:41
Compare
Choose a tag to compare
  • Full Python 3 support 👏 👍 🍻 (#106), all the thanks goes to @Preetwinder.
  • canonicalize_url method removed in favor of w3lib implementation.
  • The whole Request (incl. meta) is propagated to DB Worker, by means of scoring log (fixes #131)
  • Generating Crc32 from hostname the same way for both platforms: Python 2 and 3.
  • HBaseQueue supports delayed requests now. ‘crawl_at’ field in meta with timestamp makes request available to spiders only after moment expressed with timestamp passed. Important feature for revisiting.
  • Request object is now persisted in HBaseQueue, allowing to schedule requests with specific meta, headers, body, cookies parameters.
  • MESSAGE_BUS_CODEC option allowing to choose other than default message bus codec.
  • Strategy worker refactoring to simplify it’s customization from subclasses.
  • Fixed a bug with extracted links distribution over spider log partitions (#129).

Fixed kafka message bus crash with default codec, new options

22 Jul 15:47
Compare
Choose a tag to compare

New options for managing broad crawling queue get algorithm and improved logging in manager and strategy worker.

Fixing import crash when kafka-python isn't installed

18 Jul 15:21
Compare
Choose a tag to compare

Options, proper finishing by crawling strategy and traceback on SIGUSR1

29 Jun 08:09
Compare
Choose a tag to compare
  • CONSUMER_BATCH_SIZE is removed and two new options are introduced SPIDER_LOG_CONSUMER_BATCH_SIZE and SCORING_LOG_CONSUMER_BATCH_SIZE
  • Traceback is thrown into log when SIGUSR1 is received in DBW or SW.
  • Finishing in SW is fixed when crawling strategy reports finished.

Kafka codec option

24 Jun 10:41
Compare
Choose a tag to compare

Before that release the default compression codec was Snappy. We found out Snappy support is broken in certain Kafka versions, and issued that release. The latest version has no compression codec enabled by default, and allows to choose the compression codec with KAFKA_CODEC_LEGACY option.