Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there any mapping between different English wordnet? #176

Closed
rudaoshi opened this issue Oct 18, 2022 · 4 comments
Closed

Is there any mapping between different English wordnet? #176

rudaoshi opened this issue Oct 18, 2022 · 4 comments
Labels
question Further information is requested

Comments

@rudaoshi
Copy link

There have been may English wordnets and I wonder whether there is any mapping between the ids of synsets in these wordnets, for example, oewn/ewn <-> omw.

If there is, please tell me how to get the mapping.

Thank you ~

@fcbond
Copy link
Collaborator

fcbond commented Oct 18, 2022 via email

@goodmami goodmami added the question Further information is requested label Oct 20, 2022
@goodmami
Copy link
Owner

@rudaoshi, to add to what @fcbond said, in Wn you can use the ili member of a synset to see equivalent synsets across versions or even across lexicons for another language:

>>> import wn
>>> oewn = wn.Wordnet('oewn')
>>> wn30 = wn.Wordnet('omw-en')
>>> oewn.synsets('penumbra')[0].ili
ILI('i110430')
>>> wn30.synsets('penumbra')[0].ili
ILI('i110430')
>>> wn30.synsets(ili='i110430')[0].lemmas()
['penumbra']
>>> wnja = wn.Wordnet('omw-ja')
>>> wnja.synsets(ili='i110430')[0].lemmas()
['半影']

For the omw-en lexicons (which are directly converted from the Princeton WordNet with very few changes), the sensekeys are available as the identifier metadata of senses, but these are not available for other lexicons:

>>> wn30.senses('penumbra')[0].metadata()
{'identifier': 'penumbra%1:26:00::'}
>>> oewn.senses('penumbra')[0].metadata()
{}
>>> wnja.senses('半影')[0].metadata()
{}

@ekaf
Copy link

ekaf commented Oct 23, 2022

Thanks @goodmami and @fcbond . I did not understand this correctly before, but now, I think I start to get a more accurate picture of the implicit "mapping" in Wn. Actually, it seems that Wn does no mapping by itself, but loads resources that were previously mapped to ILI.
This mapping was done by external projects: OMW mapped the multilingual wordnets using the ili-map-pwn30.tab file from CILI-1.0, while OEWN used the corresponding pwn31 mapping.
Joining these mappings gives an intersection of 117583 identifiers, while the recall in OEWN 2021 is only 117441.

import wn

def ili_loss(wnstring1, wnstring2):
# WN 1:
    wn1 = wn.Wordnet(wnstring1)
    v1 = wn1.lexicons()
    i1 = wn1.ilis()
    n1 = len(i1)
    print(f"{v1}: {n1} synsets")
# WN 2:
    wn2 = wn.Wordnet(wnstring2)
    v2 = wn2.lexicons()
    i2 = wn2.ilis()
    n2 = len(i2)
    print(f"{v2}: {n2} synsets")
# Intersection:
    ii = set(i1).intersection(i2)
    ni = len(ii)
    print(f"Intersection: {ni} synsets")
    loss = n1 - ni
    pct = 100 * loss/n1
    print(f"Loss: {loss} synsets ({round(pct,2)})%")

ili_loss('omw-en', 'oewn')

[<Lexicon omw-en:1.4 [en]>]: 117659 synsets
[<Lexicon oewn:2021 [en]>]: 120039 synsets
Intersection: 117441 synsets
Loss: 218 synsets (0.19)%

ili_loss('omw-ja', 'oewn')

[<Lexicon omw-ja:1.4 [ja]>]: 57184 synsets
[<Lexicon oewn:2021 [en]>]: 120039 synsets
Intersection: 57076 synsets
Loss: 108 synsets (0.19)%

ili_loss('omw-arb', 'oewn')

[<Lexicon omw-arb:1.4 [arb]>]: 9916 synsets
[<Lexicon oewn:2021 [en]>]: 120039 synsets
Intersection: 9887 synsets
Loss: 29 synsets (0.29)%

I suppose that a part (though not all) of this difference can be attributed to #179.

@goodmami
Copy link
Owner

It seems like the original question has been answered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants