Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add interlinks to segment_wiki #1712

Closed
menshikh-iv opened this issue Nov 13, 2017 · 11 comments
Closed

Add interlinks to segment_wiki #1712

menshikh-iv opened this issue Nov 13, 2017 · 11 comments
Labels
difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature

Comments

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Nov 13, 2017

Idea

Users ask about this feature, this is really useful to have interlinks in the dump to construct the graph of articles or use relation between articles in any way.

What's need to implement

Add field "section_interlinks" (list of str) that contains a list of article titles referenced by this section.

@menshikh-iv menshikh-iv added feature Issue described a new feature difficulty medium Medium issue: required good gensim understanding & python skills labels Nov 13, 2017
@napsternxg
Copy link
Contributor

Thanks for creating this issue.
The suggested step is great, but to make it more consistent with the overall structure of the output, we should not only use the string of the link text, but also the Wikipage title it points to.

Another suggestion would be to include the span of the matched text with the begin offset and the end position. This will result in getting a segmented corpus for free based on your technique. It can be later used to tokenize the section text with link text items as a single unit.

So the final format may look like:

"section_interlinks":  [
("link string", "link wikipage title", offset, end)
]

@piskvorky
Copy link
Owner

piskvorky commented Nov 13, 2017

The corpora and code included with gensim are restricted to topic modelling and unsupervised text processing. We're not aiming to be "everything for everybody".

Including other types of information (supervised labels, graph structure) is possible but needs to be clearly motivated.

@napsternxg how would you use this extra information? What is the intended application?

@napsternxg
Copy link
Contributor

@piskvorky I understand the requirement to for gensim being focused on topic modelling and unsupervised text processing.

The major application area is utilizing multi-word units in Wikipedia which are usually linked to other wiki pages - as components of topic models and other text processing. E.g. simple tokenization will split words like "Barack Obama" or "Natural Language Processing". Although, there is support for extracting Ngrams using the Phrases module, a more principled approach when processing wiki pages would be to identify these phrases as a single concept (which is very easy to do for Wikipedia). Mapping the wiki link text to the wiki page would allow for normalizing these phrases to a common concept in Wikipedia. E.g. LDA is both "Latent Dirichlet allocation" and "Linear Discriminant Analysis". This will help in reducing the vocabulary size.

Finally, the motivation to allow offset and end values in the json data, was to help in overriding tokenization flaws, especially with biomedical and chemical names.

These were the use cases I had in mind. I would be happy to see this feature since I have been quite impressed with the processing speed of algorithms in gensim, and the wikipedia dump parser appears to be very fast.

Another alternative would be to use the segment_all_articles generator to add this feature as a post processing step using the article_sections variable. However, this would require that article_sections contains the original text and not the filter_wiki plain text.
https://github.com/RaRe-Technologies/gensim/blob/07c3130283a7512f74293a18eff4344cdbe85f94/gensim/scripts/segment_wiki.py#L83

@menshikh-iv
Copy link
Contributor Author

Thank you @napsternxg, maybe you'll try to implement this feature, this will be great!

@napsternxg
Copy link
Contributor

I can have a look at it after December 15th. Will send a PR then.

@steremma
Copy link
Contributor

steremma commented Jan 16, 2018

Hey @napsternxg
I have been working on adding this feature, you can check the PR. At the moment the json output contains
a list of all interlinks found in the article (rather than presenting the interlinks per section). Is there any reason why you would want to know from which section the interlink came from? If yes then we can make the change (it won't be a huge modification). Else we can merge into develop.

@piskvorky any opinions?

@piskvorky
Copy link
Owner

piskvorky commented Jan 19, 2018

Thank you for the explanation, that makes sense.

I don't think identifying the interlink location down to a section is critical. But the voice of people who actually use this feature is more important than mine -- do you think the section is important? What are the pros/cons?

@napsternxg
Copy link
Contributor

@steremma this is great thanks for adding this in. My usecase was being able to identify the multi word unit in the text along with what wiki it points to. But I don't think the current approach may be able to take care of this as the current approach removes that information and only retains the link to the wiki. If we can also have the interlink text and identify what wiki it points to that would help in training multi word word vectors more effectively. But this approach is also quite useful as we can just include the interwiki links as document tags and train the the document embeddings with that information.

@menshikh-iv
Copy link
Contributor Author

@steremma it's possible to do that @napsternxg suggested?

@steremma
Copy link
Contributor

steremma commented Jan 20, 2018

I am manually checking sample wiki pages in our test set and it appears that in most cases the text link is exactly the same as the title it points to. There are a few cases where the text is altered a little bit.

So adding this map would show an output mostly like this:
"computer science": "computer science", "mathematics": "mathematics" ...
but with occasional differences like "Android": "[Android (operating system)"

Doing this is would make a difference in my implementation because I am now using the filtered text to find the interlinks and as @napsternxg mentioned the exact article title is lost. We would need to instead duplicate the filter_wiki logic with a small change in one of the regular expressions used.

EDIT: It can be easily done by adding another boolean argument to filter_wiki. This will have a default value to make sure existing calls get the same results but when called with a False value will not modify the interlinks.

@steremma
Copy link
Contributor

Done, please check updated PR

sj29-innovate pushed a commit to sj29-innovate/gensim that referenced this issue Feb 21, 2018
…Fix piskvorky#1712 (piskvorky#1839)

* promoting the markup gives up information needed to find the intelinks

* Add interlinks to the output of `segment_wiki`

* New output format is (str, list of (str, str), list of str, reflecting
structure (title, [(section_heading, section_content), ...], [interlink, ...])

* `filter_wiki` in WikiCorpus will not promote uncaught markup to plain text
as this will give up valuable information for the interlink discovery

* Fixed PEP 8

* Refactoring identation and variable names

* Removed debugging code from script

* Fixed a bug where interlinks with a description or multiple names where disregarded

* Due to preprocessing in `filter_wiki` interlinks containing alternative names had
one of the 2 `[` and `]` characters removed. The regex now takes that into account.

* Now stripping whitespace off section titles

* Unit test `gensim.scripts.segment_wiki`

* Initiate unit testing for all scripts.

* Check for expected len given article filtering (namespace, size in characters and redirections).

* Check for yielded title, section headings and texts as well as interlinks yielded from generator.

* Check that the same is correctly persisted in JSON.

* Fix PEP 8

* Fix Python 3.5 compatibility

* Section text now completely clean from wiki markup

* Refactored filtering functions in ``wikicorpus.py` so that
uncaught markup can be optionally promoted to plain text

* Interlink extraction logic moved to `wikicorpus.py`

* Unit tests modified accordingly

* Added extra logging info to troublehsoot weird Travis behavior

* Fix PEP 8

* pin workers for segment_and_write_all_articles

* Get rid of debugging stuff

* Get rid of global logger

* Interlinks are now mapping from the linked article's title to the actual interlink text

* Used boolean argument with default argument in `filter_wiki`. The default value keeps the old functionality
so that existing code does not brake

* Overriding the default argument causes interlinks to not be simplified and lets `find_interlinks` create the mappings

* Moved regex outside function

* Interlink extraction is now optional and controlled with the `-i` command line argument

* PEP 8 long lines

* made scripts tests aware of the optional interlinks argument

* Updated script help output for interlinks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature
Projects
None yet
Development

No branches or pull requests

4 participants