Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with very large files #8

Open
do-me opened this issue Jun 28, 2024 · 3 comments
Open

Problem with very large files #8

do-me opened this issue Jun 28, 2024 · 3 comments

Comments

@do-me
Copy link

do-me commented Jun 28, 2024

I am trying to chunk a huge document but it runs forever. Did I miss something in my code?

File here

import semchunk                  
import pandas as pd 

df = pd.read_parquet("free_trade_agreement.parquet")
chunk_size = 100 # A low chunk size is used here for demonstration purposes.

text = df.text.item()

# As you can see below, `semchunk.chunkerify` will accept the names of all OpenAI models, OpenAI
# `tiktoken` encodings and Hugging Face models (in that order of precedence), along with custom
# tokenizers that have an `encode()` method (such as `tiktoken`, `transformers` and `tokenizers`
# tokenizers) and finally any function that can take a text and return the number of tokens in it.
chunker = semchunk.chunkerify(lambda text: len(text.split()), chunk_size)

a = chunker(text)

Referencing benbrandt/text-splitter#184 in semantic-text-splitter where I can now chunk the same document in ~2s.

@do-me
Copy link
Author

do-me commented Jun 30, 2024

Update: It took 593 mins on my M3 and created 130324 chunks. 😄

@umarbutler
Copy link
Owner

Hey @do-me,
Thanks for creating this issue. I certainly haven't seen anything like this before! I can confirm that it also takes an awfully long time on my PC.

It is suprising that semantic-text-splitter is able to chunk the text in 2 seconds as my benchmarks have semchunk 90% faster than semantic-text-splitter. Those benchmarks are calculated on the entire Gutenberg corpus with a chunk size of 512.

My assumption is that there are some unique characteristics of this particular piece of text that make it difficult to chunk quickly and that semantic-text-splitter has implemented some specific heuristics to make it work.

In particular, I note that:

  1. The text is extremely long, with ~86,608,271 characters and 13,399,471 words.
  2. The text has no newlines.
  3. The text does not have very many long sequences of whitespaces that may be used as splitters.

Splitting the text using string.split() takes 1.04 seconds and that is just a single call. semchunk makes repeated calls to token counters as it merges sequences of tokens back together. We already use some heuristics to avoid unnecessary calls but it is difficult to implement heuristics that do not result in the loss of certain user guarantees or otherwise reduce performance for other use cases.

At the risk of oversimplifying, the problem as I see it is that, because your text does not have any newlines nor very many sequences of whitespace characters, what ends up happening is that semchunk needs to split at individual spaces (like doing text.split()) and then needs to merge all the words back together until they hit the chunk size. Whenever semchunk tries to merge words together, it has to count how many tokens the potential merge would have and it needs to make sure that it won't exceed the chunk size. The repeated calls to token counters end up taking a very long time. I have considered multiprocessing but it is difficult to implement for single inputs (it is already available for multiple inputs).

I will investigate potential solutions but I cannot make any promises on how long it might take as I need to be careful to not reduce performance for more typical use cases.

For now, my tips to boost performance are:

  1. Don't use lambda functions as token counters.
  2. Provide max_token_chars as an argument to chunkerify where max_token_chars is the maximum number of characters that can be expected to appear in a token. This can significantly speed up token counting, particularly for very long texts.
  3. Use multiprocessing when chunking multiple texts.
  4. If you can, avoid providing very long texts without any newlines. If you have to, break them up yourself into smaller but still large chunks to be further chunked.

@do-me
Copy link
Author

do-me commented Jun 30, 2024

No pressure from my side, thank very much for investigating and thanks for the hints!

Indeed, the missing newlines is what causes problems with many splitting algorithms as everything is on the same "splitting level". It's a flaw in the dataset I am working with that unfortunately (for the moment) I cannot change.

I guess it would also be reasonable to say those kind of files are out of scope. In the end, I cannot think of any real world documents where this might actually be required...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants