Addition to NLTK migration guide w.r.t. offsets #183

BramVanroy · 2023-03-24T14:55:00Z

Is your feature request related to a problem? Please describe.
Hello

I have access to WordNet synset offset IDs that I retrieve from an API (key: wnSynsetOffset). They look like this wn:00981304a. It is relatively straightforward to get these through NLTK:

from nltk.corpus import wordnet as nltk_wn

offset = "wn:00981304a"
offset_id = int(offset.split(":")[-1][:-1])
pos = offset[-1]
syns = nltk_wn.synset_from_pos_and_offset(pos, offset_id)

However, it is not clear to me how I can convert this approach to wn. I like the API of wn more and I would like to make use of the translate feature specifically, so that is why I want to make the transition.

Describe the solution you'd like
Perhaps a description in the documentation? I think that this section is relevant but it is not clear to me how to apply it on a use-case. So a real-world example can be helpful, I think.

Describe alternatives you've considered

I have tried the following manipulations but none of them work (yielding empty synset lists):

wn.synsets("wn:00981304a")
wn.synsets("00981304a")
wn.synsets("981304a")
wn.synsets("981304", pos="a")

The text was updated successfully, but these errors were encountered:

fcbond · 2023-03-24T15:44:10Z

Hi,

if you have a wordnet derived from PWN 3.0 with the same offsets, then it can be done as follows:

>>> import wn
>>> ewn=wn.WordNet('omw-en:1.4')
>>> ewn.synset(f'omw-en-00981304-s')
Synset('omw-en-00981304-s')

Many people (including omw 1.0) treat all satellite adjectives (pos 's') as adjectives (pos 'a').
wn does not, so if you look up something with pos 'a' and it doesn't work, then it is worth also looking up 's'. So something like the following should get you what you want.

def offset2synset (wn, offset):
  wnid=  f'omw-en-{offset[3:-1]}-{offset[-1]}'
  try:
    synset = wn.synset(wnid)
  except:
    if offset[-1] == 'a':
       wnid=  f'omw-en-{offset[3:-1]}-s' 
       try:
         synset =  wn.synset(wnid)
       except:
         synset = None
    else:
      synset = None
  return synset

>>> print(offset2synset(ewn, 'wn:00981304a'))
Synset('omw-en-00981304-s')
>>> print(offset2synset(ewn, 'wn:02001858v'))
Synset('omw-en-02001858-v')

goodmami · 2023-03-29T06:31:33Z

@BramVanroy thanks for the good questions (here and on the https://github.com/goodmami/penman project, too 👋). I agree that the documentation could be improved in this area, possibly in the NLTK migration guide.

And thanks, @fcbond, for the good description and solution.

The basic problem is that synset offsets (which are specific to each wordnet version) are not an inherent part of the WN-LMF formatted lexicons that are used by Wn, but for some lexicons (mainly the omw- ones), the WordNet 3.0 offsets are conventionally used in the synset identifiers, so you just need to reformat the identifier appropriately, as @fcbond demonstrated.

Note that I also have an unmerged nltk branch that tries to implement the NLTK's API as a shim on top of Wn, and its of2ss() function is implemented using the same wn.util.synset_id_formatter() function you linked to above:

wn/wn/nltk_api.py

Lines 329 to 342 in 5092e62

    
           _ssid_from_pos_and_offset = _synset_id_formatter(prefix='omw-en') 
        
           def of2ss(of: str) -> Synset: 
        
               pos = of[-1] 
        
               offset = int(of[:8]) 
        
               ssid = _ssid_from_pos_and_offset(pos=pos, offset=offset) 
        
               try: 
        
                   synset = Synset(_wn30.synset(ssid)) 
        
               except _wn.Error: 
        
                   raise _wn.Error( 
        
                       f'No WordNet synset found for pos={pos} at offset={offset}.' 
        
                   ) 
        
               return synset

@fcbond said:

Many people (including omw 1.0) treat all satellite adjectives (pos 's') as adjectives (pos 'a').
wn does not

This is not entirely true. Wn does conflate s and a in the wn.ic, wn.morphy, wn.similarity, and wn.taxonomy modules, but it's true that it does not do so on the standard synset-lookup functions.

BramVanroy · 2023-04-04T08:15:01Z

Hello @fcbond and @goodmami

First, thanks for the help! I settled for this:

def offset2omw_synset(wnet: wn.Wordnet, offset: str) -> Optional[wn.Synset]:
    offset = offset.replace("wn:", "")
    offset = "0" * (9-len(offset)) + offset
    wnid = f"omw-en-{offset[:-1]}-{offset[-1]}"
    wnid_s = None

    try:
        return wnet.synset(wnid)
    except wn.Error:
        if wnid[-1] == "a":
            wnid_s = f"omw-en-{wnid[:-2]}-s"
            try:
                return wnet.synset(wnid_s)
            except wn.Error:
                pass

    logging.warning(f"Could not find offset {offset} ({wnid}{' or ' + wnid_s if wnid_s else ''}) in {wnet._lexicons}")

I looked at the NLTK branch @goodmami and while I think that would be very useful, I just needed a quick function that I could easily plug into my code (without having to install from GitHub). But I think it'd be a useful API to have - although I can imagine it is a lot of work!

And thank you for your work. It seems a coincidence that you are providing exactly the tools that I need for my work. I am very thankful and motivated that you created these libraries - and that they work so well and are well-documented! I've also peeked at the internals/API and documentation to inspire my own work, so a big thank you!

goodmami · 2023-04-08T22:38:14Z

Thanks for the kind words, @BramVanroy! And I'm glad you were able to find a solution. I'm going to keep the issue open because, as the issue title states, I think this sort of information would be useful in the documentation, so the issue should be closed when that happens.

BramVanroy added the enhancement New feature or request label Mar 24, 2023

goodmami added the documentation Improvements or additions to documentation label Mar 29, 2023

BramVanroy closed this as completed Apr 4, 2023

goodmami reopened this Apr 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Addition to NLTK migration guide w.r.t. offsets #183

Addition to NLTK migration guide w.r.t. offsets #183

BramVanroy commented Mar 24, 2023

fcbond commented Mar 24, 2023

goodmami commented Mar 29, 2023

BramVanroy commented Apr 4, 2023

goodmami commented Apr 8, 2023

Addition to NLTK migration guide w.r.t. offsets #183

Addition to NLTK migration guide w.r.t. offsets #183

Comments

BramVanroy commented Mar 24, 2023

fcbond commented Mar 24, 2023

goodmami commented Mar 29, 2023

BramVanroy commented Apr 4, 2023

goodmami commented Apr 8, 2023