Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sentence index when splitting long sentences into non-overlapping chunks #98

Open
nikarjunagi opened this issue Feb 23, 2022 · 0 comments

Comments

@nikarjunagi
Copy link

Hi @mandarjoshi90, thanks much for this awesome library.

Quick question - I am attempting coreference resolution on a corpus where the word count of many (tokenized) sentences is greater than max_segment_len, (say, for spanbert_base with max_segment_len = 384). I am tackling this by chunking such sentences into multiple segments by splitting them (non-overlapping).

My questions:

  1. Is this a valid approach? (in line with your response to another question here: Suggestion for doing core for longer sequences? #33)
  2. Let’s say the sentence index of a sample long sentence is X. When the tokens of this sentence are chunked between 2 segments (S1 and S2), will the sentence index for tokens in both S1 and S2 be X? Or does this need to be handled differently?

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant