Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sentence Splitters: no sentence break in between two words with no punctuation #62

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

dhruvil410
Copy link

Fix #60
We can also fix the issue by replacing \n by space at starting, when we get sentences, means we can add sentences=replace(sentences, r"\n" => Base.SubstitutionString(" ")) this line at starting of function rulebased_split_sentences(sentences). We can also add different characters other than alphanumeric in committed code.
Which is better way to fix this issue? or any suggestions other than this.

@triztian
Copy link

triztian commented Apr 5, 2021

I think perhaps adding tests would help in making this fix more robust, also since it'd be changing the output of the function, maybe make it an optional keyword arg so that those that need it to behave that way enable the behavior explicitly rather than it changing all of the sudden.

For example updating rulebased_split_sentences:

function rulebased_split_sentences(sentences)

So that it can be called like this:

rulebased_split_sentences(sentence, collapse_newlines=true)

So that multiple newlines are reduced to 1 newline and single newlines removed.

@dhruvil410
Copy link
Author

I have no idea about checks. Why didn't code pass checks?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sentence tokenization must ignore newline as whitespace in the default mode.
2 participants