-
Notifications
You must be signed in to change notification settings - Fork 968
Home
The Wikipedia extractor tool extracts plain text from a Wikipedia database dump, discarding any other information or annotation present in Wikipedia pages, such as images, tables, references and lists.
The tool is written in Python and requires no additional library.
Wikipedia articles are written in the MediaWiki Markup Language which provides a simple notation for formatting text (bolds, italics, underlines, images, tables, etc.). It is also posible to insert HTML markup in the documents. Wiki and HTML tags are sometimes misused (unclosed tags, wrong attributes, etc.), therefore the extractor deploys some heuristics in order to circumvent such problems.
The current beta version is capable of performing template expansion to some extent.
WikiExtractor.py is a Python script that extracts and cleans text from a Wikipedia database dump. The output is stored in a number of files of similar size in a given directory. Each file will contains several documents in this document format.
This is a beta version that performs template expansion by preprocesssng the whole dump and extracting template definitions.
Usage:
WikiExtractor.py [options] xml-dump-file
optional arguments:
-h, --help show this help message and exit -o OUTPUT, --output OUTPUT output directory -b n[KM], --bytes n[KM] put specified bytes per output file (default is 1M) -B BASE, --base BASE base URL for the Wikipedia pages -c, --compress compress output files using bzip -l, --links preserve links -ns ns1,ns2, --namespaces ns1,ns2 accepted namespaces -q, --quiet suppress reporting progress info -s, --sections preserve sections -a, --article analyze a file containing a single article --templates TEMPLATES use or create file containing templates -v, --version print program version
- All Wikipedia database dumps
- torrents for use with a BitTorrent client such as uTorrent
- WikiPrep A Perl tool for preprocessing Wikipedia XML dumps.
- Extracting Text from Wikipedia Another Python tool for text extracting from Wikipedia XML dumps.
- Alternative Parsers A list of links, descriptions, and status reports of the various alternative MediaWiki parsers.