Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Platform specific split string? #10

Closed
jeroen opened this issue Apr 9, 2018 · 4 comments
Closed

Platform specific split string? #10

jeroen opened this issue Apr 9, 2018 · 4 comments
Assignees
Labels

Comments

@jeroen
Copy link

jeroen commented Apr 9, 2018

If I run the example from the readme I see different output:

screen shot 2018-04-09 at 7 45 11 pm

Is this expected? I noticed you split by \r\n which is windows-specific I think? Did you test the package on non-windows machines?

@lebebr01
Copy link
Owner

lebebr01 commented Apr 9, 2018

This is not expected and not an issue I was aware of. I do not have a non-windows machine currently to run some local tests like this on.

When I wrote this portion of the code I was not aware of the tokenizers package. It would likely be more robust to use their tokenize_lines function to perform this action.

I will also brainstorm a unit test that would catch this behavior as well. I'm open to ideas for a good way to test this.

@lebebr01 lebebr01 added the bug label Apr 9, 2018
@lebebr01 lebebr01 self-assigned this Apr 9, 2018
@jeroen
Copy link
Author

jeroen commented Apr 9, 2018

Yes you should add a unit test for this. You can automatically run checks on linux and osx using travis.

@lebebr01
Copy link
Owner

lebebr01 commented Apr 9, 2018

e8d8e90 should fix issue.

Added unit test here to test for literal "\n" characters in result text: test here. Open to other ways to test this behavior.

@lebebr01 lebebr01 closed this as completed Apr 9, 2018
@jeroen
Copy link
Author

jeroen commented Apr 10, 2018

OK thanks. BTW you could reduce your dependency weight by calling stringi::stri_split_lines() and stringi::stri_split_boundaries(x, type = "word") directly rather than via tokenizers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants