Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix segment-wiki script #1694

Merged
merged 8 commits into from
Nov 6, 2017
27 changes: 26 additions & 1 deletion gensim/scripts/segment_wiki.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,29 @@
'section_titles' (list) - list of titles of sections,
'section_texts' (list) - list of content from sections.
Copy link
Owner

@piskvorky piskvorky Nov 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to include a concrete hands-on example, something like this:


Process a raw Wikipedia dump (XML.bz2 format, for example https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 for the English Wikipedia) and extract all articles and their sections as plain text::

python -m gensim.scripts.segment_wiki -f enwiki-20171001-pages-articles.xml.bz2 -o enwiki-20171001-pages-articles.json.gz

The output format of the parsed plain text Wikipedia is json-lines = one article per line, serialized into JSON. Here's an example how to work with it from Python::

# iterate over the plain text file we just created
for line in smart_open('enwiki-20171001-pages-articles.txt.gz'):
    # decode JSON into a Python object
    article = json.loads(line)

    # each article has a "title", "section_titles" and "section_texts" fields
    print("Article title: %s" % article['title']) 
    for section_title, section_text in zip(article['section_titles'], article['section_texts']):
        print("Section title: %s" % section_title)
        print("Section text: %s" % section_text)


English Wikipedia dump available
`here <https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2>`_. Approximate time
for processing is 2.5 hours (i7-6700HQ, SSD).

Examples
--------

Convert wiki to json-lines format:
`python -m gensim.scripts.segment_wiki -f enwiki-latest-pages-articles.xml.bz2 | gzip > enwiki-latest.json.gz`

Read json-lines dump

>>> # iterate over the plain text file we just created
>>> for line in smart_open('enwiki-latest.json.gz'):
>>> # decode JSON into a Python object
>>> article = json.loads(line)
>>>
>>> # each article has a "title", "section_titles" and "section_texts" fields
>>> print("Article title: %s" % article['title'])
>>> for section_title, section_text in zip(article['section_titles'], article['section_texts']):
>>> print("Section title: %s" % section_title)
>>> print("Section text: %s" % section_text)

"""

import argparse
Expand Down Expand Up @@ -226,7 +249,9 @@ def get_texts_with_sections(self):
for group in utils.chunkize(page_xmls, chunksize=10 * self.processes, maxsize=1):
for article_title, sections in pool.imap(segment, group): # chunksize=10):
# article redirects are pruned here
if any(article_title.startswith(ignore + ':') for ignore in IGNORED_NAMESPACES):
if any(article_title.startswith(ignore + ':') for ignore in IGNORED_NAMESPACES) \
or len(sections) == 0 \
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sections more Pythonic

or sections[0][1].lstrip().startswith("#REDIRECT"):
continue
articles += 1
yield (article_title, sections)
Expand Down