Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the best way to add custom attributes to Tokens? #860

Closed
acowlikeobject opened this issue Feb 26, 2017 · 6 comments
Closed

What is the best way to add custom attributes to Tokens? #860

acowlikeobject opened this issue Feb 26, 2017 · 6 comments
Labels
usage General spaCy usage

Comments

@acowlikeobject
Copy link

We need to add domain-specific annotations to Token instances after spaCy's parsing. Being able to add attributes to the Token class so that we can continue to use spaCy's Doc/Token/Span/etc. constructs would be very clean.

Is this feasible to do from within Python given that Tokens are cdefs in cython? If not, is there another way to achieve something similar?

  • Operating System: Debian 8
  • Python Version Used: 3.5
  • spaCy Version Used: 1.6
@honnibal
Copy link
Member

Hey,

The API for this is relatively recent. You want the doc.user_data dictionary, which can be keyed however you like. An example:

import spacy

def get_user_id(token):
    return token.doc.get((token.i, u'user_id'))

def set_user_id(token, value):
    token.doc.user_id[(token.i, u'user_id')] = value

nlp = spacy.load('en')
doc = nlp(u'I like billy90210')

doc.user_data[(2, 'user_id')] = u'e7f67231'

for token in doc:
    print(token.text, get_user_id(token))

To make the functionality feel more "native", we'd like to add a property to the token. Unfortunately there's no generic support for this in the code atm. This should probably change --- we should probably allow you to use the Python descriptor protocol, so you can write a custom getter/setter. The simple case of this sort of key association should also be covered.

One solution would be for you to just compile a fork, with the attribute hard-coded onto the Token. This isn't a bad solution, as your changes will almost surely merge cleanly each time. Another solution would be to subclass the Doc object, and customize the make_doc method on Language to return your Doc object. Your custom Doc object could the return a custom Token object. This is rather ugly, which is why the user_data attribute was added.

Finally, a thing to note: don't write directly onto the Token objects. These should be views that refer to the parent Doc. This way we maintain a single source of truth.

@honnibal honnibal added the usage General spaCy usage label Feb 27, 2017
@acowlikeobject
Copy link
Author

Thanks, Matt. So, till this becomes more native, doc.user_data is a "free-form" dict - the rest of spaCy has no idea what's in the dict, and doesn't touch it?

E.g., it doesn't look from the source that anything special is done with user_data on Span.merge(), etc. It's up to me to maintain this dict, correct?

And thanks for the tip on views to the Doc.

@honnibal
Copy link
Member

Correct, it's a free-form dict. It's also not serialised at the moment, unfortunately.

@acowlikeobject
Copy link
Author

Thank you. Will close this for now and keep an eye out for tighter integration into tokens.

@BenjaminBossan
Copy link

BenjaminBossan commented May 4, 2017

Is there another way to do this now? Say I want to add my own word embeddings to a token, preferably as a property that makes a lookup in my matrix. I would like to have something like:

for token in doc:
    token.myvector = property(get_vector(token.text))

...

doc = nlp('...')
print(doc[0].myvector)  # -> array([1, 2, 3])

This use case does not seem to go well with the user_data dict, what would be the best way to do this? Or should this be done completely differently?

Edit:
Okay, the snippet above wouldn't work anyway, but I think it's clear what I meant.

@lock
Copy link

lock bot commented May 8, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
usage General spaCy usage
Projects
None yet
Development

No branches or pull requests

3 participants