Skip to content
Change the repository type filter

All

    Repositories list

    • web-poet

      Public
      Web scraping Page Objects core library
      Python
      BSD 3-Clause "New" or "Revised" License
      15951414Updated Oct 10, 2024Oct 10, 2024
    • Page Object pattern for Scrapy
      Python
      BSD 3-Clause "New" or "Revised" License
      2811995Updated Oct 10, 2024Oct 10, 2024
    • spidermon

      Public
      Scrapy Extension for monitoring spiders execution.
      Python
      BSD 3-Clause "New" or "Revised" License
      96530396Updated Oct 10, 2024Oct 10, 2024
    • Software stack with latest Scrapy and updated deps
      Dockerfile
      BSD 3-Clause "New" or "Revised" License
      206021Updated Oct 7, 2024Oct 7, 2024
    • Python
      BSD 3-Clause "New" or "Revised" License
      141320Updated Oct 2, 2024Oct 2, 2024
    • python parser for human readable dates
      Python
      BSD 3-Clause "New" or "Revised" License
      4652.5k28450Updated Oct 2, 2024Oct 2, 2024
    • A python binding for crfsuite
      Python
      MIT License
      221770453Updated Oct 1, 2024Oct 1, 2024
    • streamparse lets you run Python code against real-time streams of data. Integrates with Apache Storm.
      Python
      Apache License 2.0
      217201Updated Sep 20, 2024Sep 20, 2024
    • Parse numbers written in natural language
      Python
      BSD 3-Clause "New" or "Revised" License
      23107126Updated Sep 16, 2024Sep 16, 2024
    • Formasaurus tells you the type of an HTML form and its fields using machine learning
      HTML
      47701Updated Aug 7, 2024Aug 7, 2024
    • splash

      Public
      Lightweight, scriptable browser as a service with an HTTP API
      Python
      BSD 3-Clause "New" or "Revised" License
      5154.1k37726Updated Aug 2, 2024Aug 2, 2024
    • extruct

      Public
      Extract embedded metadata from HTML markup
      Python
      BSD 3-Clause "New" or "Revised" License
      1138473815Updated Jul 25, 2024Jul 25, 2024
    • A Postgres-backed ContentsManager implementation for IPython
      Python
      Apache License 2.0
      83201Updated Jul 18, 2024Jul 18, 2024
    • Crawl Frontier HCF backend
      Python
      BSD 3-Clause "New" or "Revised" License
      5721Updated Jul 17, 2024Jul 17, 2024
    • shublang

      Public
      Pluggable DSL that uses pipes to perform a series of linear transformations to extract data
      Python
      BSD 3-Clause "New" or "Revised" License
      815236Updated Jul 9, 2024Jul 9, 2024
    • Scrapy entrypoint for Scrapinghub job runner
      Python
      BSD 3-Clause "New" or "Revised" License
      162570Updated Jul 8, 2024Jul 8, 2024
    • An opinionated fork of the Drone CI system
      Go
      Other
      363005Updated Jul 7, 2024Jul 7, 2024
    • varanus

      Public
      A command line spider monitoring tool
      Python
      7822Updated Jul 6, 2024Jul 6, 2024
    • scrapyrt

      Public
      HTTP API for Scrapy spiders
      Python
      BSD 3-Clause "New" or "Revised" License
      162832246Updated Jun 28, 2024Jun 28, 2024
    • portia

      Public
      Visual scraping for Scrapy
      Python
      BSD 3-Clause "New" or "Revised" License
      1.4k9.3k11119Updated Jun 26, 2024Jun 26, 2024
    • scikit-learn inspired API for CRFsuite
      Python
      215100Updated Jun 18, 2024Jun 18, 2024
    • Python
      MIT License
      2403Updated Jun 17, 2024Jun 17, 2024
    • autologin

      Public
      A project to attempt to automatically login to a website given a single seed
      Python
      Apache License 2.0
      441102Updated Jun 17, 2024Jun 17, 2024
    • Python wrapper for the Intercom API.
      Python
      Other
      145101Updated Jun 17, 2024Jun 17, 2024
    • luigi

      Public
      Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
      Python
      Apache License 2.0
      2.4k401Updated Jun 7, 2024Jun 7, 2024
    • mrjob

      Public
      Run MapReduce jobs on Hadoop or Amazon Web Services
      Python
      Other
      587001Updated Jun 6, 2024Jun 6, 2024
    • andi

      Public
      Library for annotation-based dependency injection
      Python
      BSD 3-Clause "New" or "Revised" License
      52031Updated Jun 3, 2024Jun 3, 2024
    • Keep docker hosts tidy
      Python
      Apache License 2.0
      50001Updated May 21, 2024May 21, 2024
    • aduana

      Public
      Frontera backend to guide a crawl using PageRank, HITS or other ranking algorithms based on the link structure of the web graph, even when making big crawls (one billion pages).
      C
      BSD 3-Clause "New" or "Revised" License
      95592Updated May 21, 2024May 21, 2024
    • exporters

      Public
      Exporters is an extensible export pipeline library that supports filter, transform and several sources and destinations
      Python
      BSD 3-Clause "New" or "Revised" License
      104057Updated May 21, 2024May 21, 2024