-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Improve segment_wiki script #1707
Conversation
@@ -54,7 +51,7 @@ | |||
logger = logging.getLogger(__name__) | |||
|
|||
|
|||
def segment_all_articles(file_path, min_article_character=200): | |||
def segment_all_articles(file_path, min_article_character=200, workers=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why workers=None
by default? If default is number of cores - 1
, this should be workers=multiprocessing.cpu_count() - 1
or something equal (same for all workers=...
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is (see the docs, updated).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not substitute the real default value (instead of None
), when you have it? Why needed None
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it more explicit -- a person viewing the signature would see a number, and perhaps not realize this number is "variable" (will be different on a different system).
In other words, I find it a little "less surprising" when only constants are the values in default parameters. Anything dynamic/mutable is handled by None
instead.
yield (article_title, sections) | ||
logger.info( | ||
"finished processing %i articles with %i sections (skipped %i redirects, %i stubs, %i ignored namespaces)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is 'sections' needed here, what is the good of it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sanity checking.
The rule of thumb for logging is: always display statistics and concrete examples, to make debugging and "sanity eyeballing" easier for humans.
Plus, they are interesting and note-worthy numbers in itself (how many sections are there, in the entire Wikipedia?).
gensim/scripts/segment_wiki.py
Outdated
parser.add_argument('-f', '--file', help='Path to MediaWiki database dump', required=True) | ||
parser.add_argument('-o', '--output', help='Path to output file (stdout if not specified)') | ||
parser.add_argument( | ||
'-w', '--workers', | ||
help='Number of parallel workers for multi-core systems (default: %i)' % default_workers, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'(default: %i)' % default workers
-> (default: %(default)s)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? That doesn't sound right, it changes the semantics (str vs int, different var name).
Or is that some automatic feature of argparse?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's automatic substitution default
value, doc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, I didn't know about that. Changed.
@@ -5,38 +5,35 @@ | |||
# Copyright (C) 2016 RaRe Technologies | |||
|
|||
""" | |||
Construct a corpus from a Wikipedia (or other MediaWiki-based) database dump (typical filename | |||
is <LANG>wiki-<YYYYMMDD>-pages-articles.xml.bz2 or <LANG>wiki-latest-pages-articles.xml.bz2), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please return concrete "patterns" for filenames, Wikipedia provides many archives, users very often use the incorrect archive and don't understand "why this doesn't work".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -54,7 +51,7 @@ | |||
logger = logging.getLogger(__name__) | |||
|
|||
|
|||
def segment_all_articles(file_path, min_article_character=200): | |||
def segment_all_articles(file_path, min_article_character=200, workers=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not substitute the real default value (instead of None
), when you have it? Why needed None
here?
* improve segment_wiki docstring * add segment_wiki option to control the number of multicore workers * improve segment_wiki logging * update segment_wiki timing * clarify segment_wiki docs * fix argparse defaults in segment_wiki * log more detailed progress in segment_wiki * example of MediaWiki format
This PR cleans up the documentation of the
segment_wiki.py
script, continuing from #1694.I also added a new
-w
option to the script, letting users control the number of workers.I also fixed some broken formatting in the release notes in the section on segment_wiki.