-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem with very large files #8
Comments
Update: It took 593 mins on my M3 and created 130324 chunks. 😄 |
Hey @do-me, It is suprising that My assumption is that there are some unique characteristics of this particular piece of text that make it difficult to chunk quickly and that In particular, I note that:
Splitting the text using At the risk of oversimplifying, the problem as I see it is that, because your text does not have any newlines nor very many sequences of whitespace characters, what ends up happening is that I will investigate potential solutions but I cannot make any promises on how long it might take as I need to be careful to not reduce performance for more typical use cases. For now, my tips to boost performance are:
|
No pressure from my side, thank very much for investigating and thanks for the hints! Indeed, the missing newlines is what causes problems with many splitting algorithms as everything is on the same "splitting level". It's a flaw in the dataset I am working with that unfortunately (for the moment) I cannot change. I guess it would also be reasonable to say those kind of files are out of scope. In the end, I cannot think of any real world documents where this might actually be required... |
I am trying to chunk a huge document but it runs forever. Did I miss something in my code?
File here
Referencing benbrandt/text-splitter#184 in semantic-text-splitter where I can now chunk the same document in ~2s.
The text was updated successfully, but these errors were encountered: