Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add article interlinks to the output of gensim.scripts.segment_wiki. Fix #1712 #1839

Merged
merged 23 commits into from
Jan 31, 2018

Conversation

steremma
Copy link
Contributor

@steremma steremma commented Jan 13, 2018

Interlinks

This PR adds the feature requested in #1712
The segment_wiki script now includes the interlinks (links to other articles in the dump) found in each article it processes.

Implementation

The interlink markup parsing takes place in the filtered text rather than the raw one in order to avoid code and logic duplication (similar markup would need to be removed again which is already done in filter_wiki). In order to achieve this the function filter_wiki must not promote the uncaught markup to plain text. The usefulness of this operation was already doubted (see comment in old version) so I removed the promotion altogether:

# the following is needed to make the tokenizer see '[[socialist]]s' as a single word 'socialists'
# TODO is this really desirable?
text = text.replace('[', '').replace(']', '')  

Removing this line was important because:

  1. Phrases like "Computer Science" which has a meaning as a phrase will lose its meaning if its promoted to plain text (2 tokens)

  2. the [[ and ]] are the markup used for interlinks. So if we promote them then we lose this information

Since the output of the script and its helper functions changed, I modified the documentation (source comments) accordingly.

* New output format is (str, list of (str, str), list of str, reflecting
structure (title, [(section_heading, section_content), ...], [interlink, ...])

* `filter_wiki` in WikiCorpus will not promote uncaught markup to plain text
as this will give up valuable information for the interlink discovery
Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great start @steremma!
It will be really nice if you add several simple tests for segment_wiki script (you already have small wiki dump as test data)

for idx, (article_title, article_sections) in enumerate(article_stream):
output_data = {"title": article_title, "section_titles": [], "section_texts": []}
for idx, (article_title, article_sections, article_interlinks) in enumerate(article_stream):
output_data = {"title": article_title,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use hanging indents (instead of vertical)

@@ -321,3 +336,15 @@ def get_texts_with_sections(self):
)

logger.info("finished running %s", sys.argv[0])

print("-----Now checking output--------\n\n\n")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? This isn't needed here.

output_data = {"title": article_title,
"section_titles": [],
"section_texts": [],
"section_interlinks": article_interlinks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't split interlinks by sections, this should name as "interlinks" instead of "section_interlinks".

…re disregarded

* Due to preprocessing in `filter_wiki` interlinks containing alternative names had
one of the 2 `[` and `]` characters removed. The regex now takes that into account.
* Initiate unit testing for all scripts.

* Check for expected len given article filtering (namespace, size in characters and redirections).

* Check for yielded title, section headings and texts as well as interlinks yielded from generator.

* Check that the same is correctly persisted in JSON.

* Fix PEP 8
* Refactored filtering functions in ``wikicorpus.py` so that
uncaught markup can be optionally promoted to plain text

* Interlink extraction logic moved to `wikicorpus.py`

* Unit tests modified accordingly
@steremma steremma closed this Jan 16, 2018
@steremma steremma reopened this Jan 16, 2018
@menshikh-iv menshikh-iv changed the title Add article interlinks to the output of gensim.scripts.segment_wiki() Add article interlinks to the output of gensim.scripts.segment_wiki. Fix #1712 Jan 18, 2018
Find all interlinks to other articles in the dump. `raw` is either unicode
or utf-8 encoded string.
"""
interlink_regex_capture = r"\[{1,2}(.*?)\]{1,2}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this regex can be compiled outside the function.

…ual interlink text

* Used boolean argument with default argument in `filter_wiki`. The default value keeps the old functionality
so that existing code does not brake

* Overriding the default argument causes interlinks to not be simplified and lets `find_interlinks` create the mappings
@steremma steremma closed this Jan 21, 2018
@steremma steremma reopened this Jan 21, 2018
@menshikh-iv
Copy link
Contributor

@steremma please don't forget to

  • resolve the merge conflict
  • run segment_wiki on full wiki dump (with python2 and python3), this needs some time

…into interlinks

* Kept documentation improvements from upstream

* Kept interlink support and updated signatures from my branch

* Added documentation from my extra arguments in correct format
@steremma steremma closed this Jan 25, 2018
@steremma steremma reopened this Jan 25, 2018
@menshikh-iv
Copy link
Contributor

Looks great, thank you @steremma!

@menshikh-iv menshikh-iv merged commit aa10f79 into piskvorky:develop Jan 31, 2018
@steremma
Copy link
Contributor Author

👍

sj29-innovate pushed a commit to sj29-innovate/gensim that referenced this pull request Feb 21, 2018
…Fix piskvorky#1712 (piskvorky#1839)

* promoting the markup gives up information needed to find the intelinks

* Add interlinks to the output of `segment_wiki`

* New output format is (str, list of (str, str), list of str, reflecting
structure (title, [(section_heading, section_content), ...], [interlink, ...])

* `filter_wiki` in WikiCorpus will not promote uncaught markup to plain text
as this will give up valuable information for the interlink discovery

* Fixed PEP 8

* Refactoring identation and variable names

* Removed debugging code from script

* Fixed a bug where interlinks with a description or multiple names where disregarded

* Due to preprocessing in `filter_wiki` interlinks containing alternative names had
one of the 2 `[` and `]` characters removed. The regex now takes that into account.

* Now stripping whitespace off section titles

* Unit test `gensim.scripts.segment_wiki`

* Initiate unit testing for all scripts.

* Check for expected len given article filtering (namespace, size in characters and redirections).

* Check for yielded title, section headings and texts as well as interlinks yielded from generator.

* Check that the same is correctly persisted in JSON.

* Fix PEP 8

* Fix Python 3.5 compatibility

* Section text now completely clean from wiki markup

* Refactored filtering functions in ``wikicorpus.py` so that
uncaught markup can be optionally promoted to plain text

* Interlink extraction logic moved to `wikicorpus.py`

* Unit tests modified accordingly

* Added extra logging info to troublehsoot weird Travis behavior

* Fix PEP 8

* pin workers for segment_and_write_all_articles

* Get rid of debugging stuff

* Get rid of global logger

* Interlinks are now mapping from the linked article's title to the actual interlink text

* Used boolean argument with default argument in `filter_wiki`. The default value keeps the old functionality
so that existing code does not brake

* Overriding the default argument causes interlinks to not be simplified and lets `find_interlinks` create the mappings

* Moved regex outside function

* Interlink extraction is now optional and controlled with the `-i` command line argument

* PEP 8 long lines

* made scripts tests aware of the optional interlinks argument

* Updated script help output for interlinks
@steremma steremma deleted the interlinks branch March 1, 2018 10:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants