Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Improve segment_wiki script #1707

Merged
merged 8 commits into from
Nov 11, 2017
Merged

[MRG] Improve segment_wiki script #1707

merged 8 commits into from
Nov 11, 2017

Conversation

piskvorky
Copy link
Owner

@piskvorky piskvorky commented Nov 10, 2017

This PR cleans up the documentation of the segment_wiki.py script, continuing from #1694.

I also added a new -w option to the script, letting users control the number of workers.

I also fixed some broken formatting in the release notes in the section on segment_wiki.

@piskvorky piskvorky changed the title Improve segment_wiki script [WIP] Improve segment_wiki script Nov 10, 2017
@@ -54,7 +51,7 @@
logger = logging.getLogger(__name__)


def segment_all_articles(file_path, min_article_character=200):
def segment_all_articles(file_path, min_article_character=200, workers=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why workers=None by default? If default is number of cores - 1, this should be workers=multiprocessing.cpu_count() - 1 or something equal (same for all workers=...)

Copy link
Owner Author

@piskvorky piskvorky Nov 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is (see the docs, updated).

Copy link
Contributor

@menshikh-iv menshikh-iv Nov 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not substitute the real default value (instead of None), when you have it? Why needed None here?

Copy link
Owner Author

@piskvorky piskvorky Nov 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it more explicit -- a person viewing the signature would see a number, and perhaps not realize this number is "variable" (will be different on a different system).

In other words, I find it a little "less surprising" when only constants are the values in default parameters. Anything dynamic/mutable is handled by None instead.

yield (article_title, sections)
logger.info(
"finished processing %i articles with %i sections (skipped %i redirects, %i stubs, %i ignored namespaces)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is 'sections' needed here, what is the good of it?

Copy link
Owner Author

@piskvorky piskvorky Nov 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sanity checking.

The rule of thumb for logging is: always display statistics and concrete examples, to make debugging and "sanity eyeballing" easier for humans.

Plus, they are interesting and note-worthy numbers in itself (how many sections are there, in the entire Wikipedia?).

parser.add_argument('-f', '--file', help='Path to MediaWiki database dump', required=True)
parser.add_argument('-o', '--output', help='Path to output file (stdout if not specified)')
parser.add_argument(
'-w', '--workers',
help='Number of parallel workers for multi-core systems (default: %i)' % default_workers,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'(default: %i)' % default workers -> (default: %(default)s)

Copy link
Owner Author

@piskvorky piskvorky Nov 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? That doesn't sound right, it changes the semantics (str vs int, different var name).

Or is that some automatic feature of argparse?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's automatic substitution default value, doc

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I didn't know about that. Changed.

@piskvorky piskvorky changed the title [WIP] Improve segment_wiki script [MRG] Improve segment_wiki script Nov 11, 2017
@@ -5,38 +5,35 @@
# Copyright (C) 2016 RaRe Technologies

"""
Construct a corpus from a Wikipedia (or other MediaWiki-based) database dump (typical filename
is <LANG>wiki-<YYYYMMDD>-pages-articles.xml.bz2 or <LANG>wiki-latest-pages-articles.xml.bz2),
Copy link
Contributor

@menshikh-iv menshikh-iv Nov 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please return concrete "patterns" for filenames, Wikipedia provides many archives, users very often use the incorrect archive and don't understand "why this doesn't work".

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -54,7 +51,7 @@
logger = logging.getLogger(__name__)


def segment_all_articles(file_path, min_article_character=200):
def segment_all_articles(file_path, min_article_character=200, workers=None):
Copy link
Contributor

@menshikh-iv menshikh-iv Nov 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not substitute the real default value (instead of None), when you have it? Why needed None here?

@menshikh-iv menshikh-iv merged commit a21e7df into develop Nov 11, 2017
@menshikh-iv menshikh-iv deleted the fix_segment_wiki branch November 11, 2017 10:01
VaiyeBe pushed a commit to VaiyeBe/gensim that referenced this pull request Nov 26, 2017
* improve segment_wiki docstring

* add segment_wiki option to control the number of multicore workers

* improve segment_wiki logging

* update segment_wiki timing

* clarify segment_wiki docs

* fix argparse defaults in segment_wiki

* log more detailed progress in segment_wiki

* example of MediaWiki format
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants