Add interlinks to segment_wiki #1712

menshikh-iv · 2017-11-13T08:26:54Z

Idea

Users ask about this feature, this is really useful to have interlinks in the dump to construct the graph of articles or use relation between articles in any way.

What's need to implement

Add field "section_interlinks" (list of str) that contains a list of article titles referenced by this section.

The text was updated successfully, but these errors were encountered:

napsternxg · 2017-11-13T08:57:17Z

Thanks for creating this issue.
The suggested step is great, but to make it more consistent with the overall structure of the output, we should not only use the string of the link text, but also the Wikipage title it points to.

Another suggestion would be to include the span of the matched text with the begin offset and the end position. This will result in getting a segmented corpus for free based on your technique. It can be later used to tokenize the section text with link text items as a single unit.

So the final format may look like:

"section_interlinks":  [
("link string", "link wikipage title", offset, end)
]

piskvorky · 2017-11-13T12:49:47Z

The corpora and code included with gensim are restricted to topic modelling and unsupervised text processing. We're not aiming to be "everything for everybody".

Including other types of information (supervised labels, graph structure) is possible but needs to be clearly motivated.

@napsternxg how would you use this extra information? What is the intended application?

napsternxg · 2017-11-13T19:00:56Z

@piskvorky I understand the requirement to for gensim being focused on topic modelling and unsupervised text processing.

The major application area is utilizing multi-word units in Wikipedia which are usually linked to other wiki pages - as components of topic models and other text processing. E.g. simple tokenization will split words like "Barack Obama" or "Natural Language Processing". Although, there is support for extracting Ngrams using the Phrases module, a more principled approach when processing wiki pages would be to identify these phrases as a single concept (which is very easy to do for Wikipedia). Mapping the wiki link text to the wiki page would allow for normalizing these phrases to a common concept in Wikipedia. E.g. LDA is both "Latent Dirichlet allocation" and "Linear Discriminant Analysis". This will help in reducing the vocabulary size.

Finally, the motivation to allow offset and end values in the json data, was to help in overriding tokenization flaws, especially with biomedical and chemical names.

These were the use cases I had in mind. I would be happy to see this feature since I have been quite impressed with the processing speed of algorithms in gensim, and the wikipedia dump parser appears to be very fast.

Another alternative would be to use the segment_all_articles generator to add this feature as a post processing step using the article_sections variable. However, this would require that article_sections contains the original text and not the filter_wiki plain text.
https://github.com/RaRe-Technologies/gensim/blob/07c3130283a7512f74293a18eff4344cdbe85f94/gensim/scripts/segment_wiki.py#L83

menshikh-iv · 2017-11-13T20:13:21Z

Thank you @napsternxg, maybe you'll try to implement this feature, this will be great!

napsternxg · 2017-11-13T20:20:02Z

I can have a look at it after December 15th. Will send a PR then.

steremma · 2018-01-16T16:51:05Z

Hey @napsternxg
I have been working on adding this feature, you can check the PR. At the moment the json output contains
a list of all interlinks found in the article (rather than presenting the interlinks per section). Is there any reason why you would want to know from which section the interlink came from? If yes then we can make the change (it won't be a huge modification). Else we can merge into develop.

@piskvorky any opinions?

piskvorky · 2018-01-19T12:23:25Z

Thank you for the explanation, that makes sense.

I don't think identifying the interlink location down to a section is critical. But the voice of people who actually use this feature is more important than mine -- do you think the section is important? What are the pros/cons?

napsternxg · 2018-01-19T19:04:54Z

@steremma this is great thanks for adding this in. My usecase was being able to identify the multi word unit in the text along with what wiki it points to. But I don't think the current approach may be able to take care of this as the current approach removes that information and only retains the link to the wiki. If we can also have the interlink text and identify what wiki it points to that would help in training multi word word vectors more effectively. But this approach is also quite useful as we can just include the interwiki links as document tags and train the the document embeddings with that information.

menshikh-iv · 2018-01-20T08:18:47Z

@steremma it's possible to do that @napsternxg suggested?

steremma · 2018-01-20T11:43:36Z

I am manually checking sample wiki pages in our test set and it appears that in most cases the text link is exactly the same as the title it points to. There are a few cases where the text is altered a little bit.

So adding this map would show an output mostly like this:
"computer science": "computer science", "mathematics": "mathematics" ...
but with occasional differences like "Android": "[Android (operating system)"

Doing this is would make a difference in my implementation because I am now using the filtered text to find the interlinks and as @napsternxg mentioned the exact article title is lost. We would need to instead duplicate the filter_wiki logic with a small change in one of the regular expressions used.

EDIT: It can be easily done by adding another boolean argument to filter_wiki. This will have a default value to make sure existing calls get the same results but when called with a False value will not modify the interlinks.

steremma · 2018-01-21T12:13:23Z

Done, please check updated PR

…Fix piskvorky#1712 (piskvorky#1839) * promoting the markup gives up information needed to find the intelinks * Add interlinks to the output of `segment_wiki` * New output format is (str, list of (str, str), list of str, reflecting structure (title, [(section_heading, section_content), ...], [interlink, ...]) * `filter_wiki` in WikiCorpus will not promote uncaught markup to plain text as this will give up valuable information for the interlink discovery * Fixed PEP 8 * Refactoring identation and variable names * Removed debugging code from script * Fixed a bug where interlinks with a description or multiple names where disregarded * Due to preprocessing in `filter_wiki` interlinks containing alternative names had one of the 2 `[` and `]` characters removed. The regex now takes that into account. * Now stripping whitespace off section titles * Unit test `gensim.scripts.segment_wiki` * Initiate unit testing for all scripts. * Check for expected len given article filtering (namespace, size in characters and redirections). * Check for yielded title, section headings and texts as well as interlinks yielded from generator. * Check that the same is correctly persisted in JSON. * Fix PEP 8 * Fix Python 3.5 compatibility * Section text now completely clean from wiki markup * Refactored filtering functions in ``wikicorpus.py` so that uncaught markup can be optionally promoted to plain text * Interlink extraction logic moved to `wikicorpus.py` * Unit tests modified accordingly * Added extra logging info to troublehsoot weird Travis behavior * Fix PEP 8 * pin workers for segment_and_write_all_articles * Get rid of debugging stuff * Get rid of global logger * Interlinks are now mapping from the linked article's title to the actual interlink text * Used boolean argument with default argument in `filter_wiki`. The default value keeps the old functionality so that existing code does not brake * Overriding the default argument causes interlinks to not be simplified and lets `find_interlinks` create the mappings * Moved regex outside function * Interlink extraction is now optional and controlled with the `-i` command line argument * PEP 8 long lines * made scripts tests aware of the optional interlinks argument * Updated script help output for interlinks

menshikh-iv added feature Issue described a new feature difficulty medium Medium issue: required good gensim understanding & python skills labels Nov 13, 2017

steremma mentioned this issue Jan 13, 2018

Add article interlinks to the output of gensim.scripts.segment_wiki. Fix #1712 #1839

Merged

menshikh-iv closed this as completed in aa10f79 Jan 31, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add interlinks to segment_wiki #1712

Add interlinks to segment_wiki #1712

menshikh-iv commented Nov 13, 2017 •

edited

Loading

napsternxg commented Nov 13, 2017

piskvorky commented Nov 13, 2017 •

edited

Loading

napsternxg commented Nov 13, 2017

menshikh-iv commented Nov 13, 2017

napsternxg commented Nov 13, 2017

steremma commented Jan 16, 2018 •

edited

Loading

piskvorky commented Jan 19, 2018 •

edited

Loading

napsternxg commented Jan 19, 2018

menshikh-iv commented Jan 20, 2018

steremma commented Jan 20, 2018 •

edited

Loading

steremma commented Jan 21, 2018

Add interlinks to segment_wiki #1712

Add interlinks to segment_wiki #1712

Comments

menshikh-iv commented Nov 13, 2017 • edited Loading

Idea

What's need to implement

napsternxg commented Nov 13, 2017

piskvorky commented Nov 13, 2017 • edited Loading

napsternxg commented Nov 13, 2017

menshikh-iv commented Nov 13, 2017

napsternxg commented Nov 13, 2017

steremma commented Jan 16, 2018 • edited Loading

piskvorky commented Jan 19, 2018 • edited Loading

napsternxg commented Jan 19, 2018

menshikh-iv commented Jan 20, 2018

steremma commented Jan 20, 2018 • edited Loading

steremma commented Jan 21, 2018

menshikh-iv commented Nov 13, 2017 •

edited

Loading

piskvorky commented Nov 13, 2017 •

edited

Loading

steremma commented Jan 16, 2018 •

edited

Loading

piskvorky commented Jan 19, 2018 •

edited

Loading

steremma commented Jan 20, 2018 •

edited

Loading