Skip to content
Change the repository type filter

All

    Repositories list

    • khazeshgar.ir
      CSS
      0100Updated May 6, 2017May 6, 2017
    • Crawler for gab website emails
      Java
      0100Updated Feb 13, 2017Feb 13, 2017
    • This package present some io function that help you to fast as fast file read and write
      Java
      0100Updated Feb 13, 2017Feb 13, 2017
    • fess

      Public
      Fess is very powerful and easily deployable Enterprise Search Server.
      Java
      Other
      167100Updated Feb 10, 2017Feb 10, 2017
    • importer

      Public
      Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
      Java
      23100Updated Feb 8, 2017Feb 8, 2017
    • gecco

      Public
      Easy to use lightweight web crawler(易用的轻量化网络爬虫)
      Java
      MIT License
      891100Updated Feb 8, 2017Feb 8, 2017
    • A set of reusable Java components that implement functionality common to any web crawler
      Java
      Apache License 2.0
      75100Updated Feb 7, 2017Feb 7, 2017
    • Norconex HTTP Collector is a flexible web crawler for collecting, parsing, and manipulating data from the Internet (or Intranet) to various data repositories such as search engines.
      Java
      68000Updated Feb 6, 2017Feb 6, 2017
    • okhttp

      Public
      An HTTP+HTTP/2 client for Android and Java applications.
      Java
      Apache License 2.0
      9.2k000Updated Feb 5, 2017Feb 5, 2017
    • List of Some Crawler!
      GNU General Public License v3.0
      0100Updated Feb 3, 2017Feb 3, 2017
    • News crawling with SC - stores output as WARC
      Java
      Apache License 2.0
      34100Updated Feb 3, 2017Feb 3, 2017
    • crawler4j

      Public
      Open Source Web Crawler for Java
      Java
      Other
      1.9k100Updated Jan 31, 2017Jan 31, 2017
    • webmagic

      Public
      A scalable web crawler framework for Java.
      Java
      4.2k100Updated Jan 27, 2017Jan 27, 2017
    • 0000Updated Jan 27, 2017Jan 27, 2017
    • 0100Updated Jan 27, 2017Jan 27, 2017
    • Extract tables from PDF files
      Java
      MIT License
      425100Updated Jan 25, 2017Jan 25, 2017
    • heritrix3

      Public
      Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
      Java
      762100Updated Jan 23, 2017Jan 23, 2017
    • 一个敏捷的,分布式的爬虫框架;An agile, distributed crawler framework.
      Java
      Apache License 2.0
      680000Updated Jan 11, 2017Jan 11, 2017
    • WebCollector is an open source web crawler framework based on Java.It provides some simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes.
      Java
      GNU General Public License v3.0
      1.5k100Updated Jan 7, 2017Jan 7, 2017
    • webporter

      Public
      基于 webmagic 的 Java 爬虫应用
      Java
      859100Updated Dec 27, 2016Dec 27, 2016
    • A collection of awesome web crawler,spider in different languages
      MIT License
      706100Updated Dec 2, 2016Dec 2, 2016
    • A html parser with xpath base on Jsoup.Maybe it is the best in java,ha ha.Just try it.
      Java
      154100Updated Nov 16, 2016Nov 16, 2016
    • This is a mirror of the script by Giuseppe Attardi, and contains history before the official repo started: https://github.com/attardi/wikiextractor --- Extracts and cleans text from Wikipedia database dump and stores output in a number of files of similar size in a given directory.
      Python
      94100Updated Aug 17, 2016Aug 17, 2016
    • این مخزن شامل کد تست سلنیوم برای وبسایت سان مارکت می باشد که به زبان جاوا نوشته شده است
      Java
      1000Updated Jan 2, 2016Jan 2, 2016
    • anthelion

      Public
      Anthelion is a plugin for Apache Nutch to crawl semantic annotations within HTML pages
      Java
      Apache License 2.0
      666100Updated Dec 17, 2015Dec 17, 2015
    • crawler

      Public
      Simple java web crawler
      Java
      Apache License 2.0
      54100Updated May 15, 2015May 15, 2015
    • crawler-1

      Public
      Simple java web crawler
      Java
      38100Updated Dec 2, 2014Dec 2, 2014
    • The CommonCrawl Crawler Engine and Related MapReduce code
      Java
      63100Updated Jul 14, 2013Jul 14, 2013
    • Crawler-2

      Public
      simple crawler that fetches all the http://mehrnews.ir's news
      Java
      1100Updated May 24, 2011May 24, 2011