Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

set vector of merged entity #5085

Merged
merged 5 commits into from
Mar 6, 2020

Conversation

svlandeg
Copy link
Member

@svlandeg svlandeg commented Mar 2, 2020

Closes #5082

Description

When using the merge_entities pipe, the vector of the merged token is now appropriately set to the vector of the original span (i.e. the average of the original tokens). Before it was just a zero vector.

Types of change

enhancement

Checklist

  • I have submitted the spaCy Contributor Agreement.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@svlandeg svlandeg added feat / doc Feature: Doc, Span and Token objects feat / vectors Feature: Word vectors and similarity labels Mar 2, 2020
@svlandeg svlandeg added the enhancement Feature requests and improvements label Mar 2, 2020
@adrianeboyd
Copy link
Contributor

@svlandeg: do you think this is something the retokenizer should handle more generally?

@svlandeg
Copy link
Member Author

svlandeg commented Mar 3, 2020

Yea, perhaps? I was doubting where to put it. You think it would be better moved to the _merge function?

@adrianeboyd
Copy link
Contributor

I think it might make more sense to have it near .tensor here:

# Resize the doc.tensor, if it's set. Let the last row for each token stand
# for the merged region. To do this, we create a boolean array indicating
# whether the row is to be deleted, then use numpy.delete
if doc.tensor is not None and doc.tensor.size != 0:
doc.tensor = _resize_tensor(doc.tensor,
[(m[0].start, m[0].end) for m in merges])

I was a little worried initially that it would be tricky to add new vectors if the table is full, but I see that it gets resized automatically.

One downside if it's not optional is that the size of the vectors could keep growing and growing, e.g., in a situation like #5083? I'm not sure this is a major concern, but I was just looking at all the places that can grow...

@svlandeg
Copy link
Member Author

svlandeg commented Mar 3, 2020

Ok I moved it. The issue of the growing Vocab probably needs a more generic solution that is out-of-scope for this PR...

@honnibal honnibal merged commit 1a2b8fc into explosion:master Mar 6, 2020
@svlandeg svlandeg deleted the feature/merged-entity-vectors branch March 6, 2020 13:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests and improvements feat / doc Feature: Doc, Span and Token objects feat / vectors Feature: Word vectors and similarity
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Option to average vectors when using "merge_entity" pipeline.
3 participants