Sentence index when splitting long sentences into non-overlapping chunks #98

nikarjunagi · 2022-02-23T04:30:16Z

Hi @mandarjoshi90, thanks much for this awesome library.

Quick question - I am attempting coreference resolution on a corpus where the word count of many (tokenized) sentences is greater than max_segment_len, (say, for spanbert_base with max_segment_len = 384). I am tackling this by chunking such sentences into multiple segments by splitting them (non-overlapping).

My questions:

Is this a valid approach? (in line with your response to another question here: Suggestion for doing core for longer sequences? #33)
Let’s say the sentence index of a sample long sentence is X. When the tokens of this sentence are chunked between 2 segments (S1 and S2), will the sentence index for tokens in both S1 and S2 be X? Or does this need to be handled differently?

Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sentence index when splitting long sentences into non-overlapping chunks #98

Sentence index when splitting long sentences into non-overlapping chunks #98

nikarjunagi commented Feb 23, 2022

Sentence index when splitting long sentences into non-overlapping chunks #98

Sentence index when splitting long sentences into non-overlapping chunks #98

Comments

nikarjunagi commented Feb 23, 2022