Skip to content

Commit

Permalink
update perf numbers of segment_wiki from a fresh h3 run
Browse files Browse the repository at this point in the history
  • Loading branch information
piskvorky authored and KMarie1 committed Nov 26, 2017
1 parent e17545b commit 8d5515e
Showing 1 changed file with 7 additions and 7 deletions.
14 changes: 7 additions & 7 deletions gensim/scripts/segment_wiki.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,14 +5,14 @@
# Copyright (C) 2016 RaRe Technologies

"""
CLI script for extracting plain text out of a raw Wikipedia dump. This is a xml.bz2 file provided by MediaWiki \
and looks like <LANG>wiki-<YYYYMMDD>-pages-articles.xml.bz2 or <LANG>wiki-latest-pages-articles.xml.bz2 \
(e.g. 14 GB: https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2).
CLI script for extracting plain text out of a raw Wikipedia dump. Input is an xml.bz2 file provided by MediaWiki \
that looks like <LANG>wiki-<YYYYMMDD>-pages-articles.xml.bz2 or <LANG>wiki-latest-pages-articles.xml.bz2 \
(e.g. 14 GB of https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2).
It streams through all the XML articles using multiple cores (#cores - 1, by default), \
decompressing on the fly and extracting plain text article sections from each article.
decompressing on the fly and extracting plain text from the articles and their sections.
For each article, it prints its title, section names and section contents, in json-line format.
For each extracted article, it prints its title, section names and plain text section contents, in json-line format.
Examples
--------
Expand All @@ -21,8 +21,8 @@
python -m gensim.scripts.segment_wiki -f enwiki-latest-pages-articles.xml.bz2 -o enwiki-latest.json.gz
Processing the entire English Wikipedia dump takes 2 hours (about 2.5 million articles \
per hour, on 8 core Intel Xeon E3-1275@3.60GHz).
Processing the entire English Wikipedia dump takes 1.7 hours (about 3 million articles per hour, \
or 10 MB of XML per second) on an 8 core Intel i7-7700 @3.60GHz.
You can then read the created output (~6.1 GB gzipped) with:
Expand Down

0 comments on commit 8d5515e

Please sign in to comment.