Home

Table of Contents Introduction Usage Wikipeda dumps Related Work

Introduction

The Wikipedia extractor tool extracts plain text from a Wikipedia database dump, discarding any other information or annotation present in Wikipedia pages, such as images, tables, references and lists.

The tool is written in Python and requires no additional library.

Wikipedia articles are written in the MediaWiki Markup Language which provides a simple notation for formatting text (bolds, italics, underlines, images, tables, etc.). It is also posible to insert HTML markup in the documents. Wiki and HTML tags are sometimes misused (unclosed tags, wrong attributes, etc.), therefore the extractor deploys some heuristics in order to circumvent such problems.

The current beta version is capable of performing template expansion to some extent.

Usage

WikiExtractor.py is a Python script that extracts and cleans text from a Wikipedia database dump. The output is stored in a number of files of similar size in a given directory. Each file will contains several documents in this document format.

This is a beta version that performs template expansion by preprocesssng the whole dump and extracting template definitions.

Usage:

 WikiExtractor.py [options] xml-dump-file

optional arguments:

  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        output directory
  -b n[KM], --bytes n[KM]
                        put specified bytes per output file (default is 1M)
  -B BASE, --base BASE  base URL for the Wikipedia pages
  -c, --compress        compress output files using bzip
  -l, --links           preserve links
  -ns ns1,ns2, --namespaces ns1,ns2
                        accepted namespaces
  -q, --quiet           suppress reporting progress info
  -s, --sections        preserve sections
  -a, --article         analyze a file containing a single article
  --templates TEMPLATES
                        use or create file containing templates
  -v, --version         print program version

Wikipeda dumps

All Wikipedia database dumps
torrents for use with a BitTorrent client such as uTorrent

Related Work

WikiPrep A Perl tool for preprocessing Wikipedia XML dumps.
Extracting Text from Wikipedia Another Python tool for text extracting from Wikipedia XML dumps.
Alternative Parsers A list of links, descriptions, and status reports of the various alternative MediaWiki parsers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Table of Contents

Introduction

Usage

Wikipeda dumps

Related Work

Clone this wiki locally