Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If distances is empty in the 'gradient' option of semantic chunker it causes IndexError. #26221

Closed
5 tasks done
Aryazaky opened this issue Sep 9, 2024 · 7 comments · Fixed by #26629
Closed
5 tasks done

Comments

@Aryazaky
Copy link

Aryazaky commented Sep 9, 2024

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

from langchain_huggingface import HuggingFaceEmbeddings

embedding_model_name = "LazarusNLP/all-indo-e5-small-v4"

embedding_model = HuggingFaceEmbeddings(
    model_name = embedding_model_name,
    encode_kwargs = {'normalize_embeddings': True}
    )

from langchain_experimental.text_splitter import SemanticChunker

text_splitter = SemanticChunker(
    embedding_model,
    breakpoint_threshold_type = "gradient",
    )
chunks = text_splitter.split_documents(docs) # docs are pages of a pdf loaded using pymupdf

Error Message and Stack Trace (if applicable)

IndexError                                Traceback (most recent call last)
[<ipython-input-9-dd360b9e5d4f>](https://localhost:8080/#) in <cell line: 31>()
     29     return chunks
     30 
---> 31 chunks = text_splitter.split_documents(docs)
     32 chunks = calculate_chunk_ids(chunks)
     33 print(f"Split into {len(chunks)} chunks")

4 frames
[/usr/local/lib/python3.10/dist-packages/langchain_experimental/text_splitter.py](https://localhost:8080/#) in split_documents(self, documents)
    279             texts.append(doc.page_content)
    280             metadatas.append(doc.metadata)
--> 281         return self.create_documents(texts, metadatas=metadatas)
    282 
    283     def transform_documents(

[/usr/local/lib/python3.10/dist-packages/langchain_experimental/text_splitter.py](https://localhost:8080/#) in create_documents(self, texts, metadatas)
    264         for i, text in enumerate(texts):
    265             start_index = 0
--> 266             for chunk in self.split_text(text):
    267                 metadata = copy.deepcopy(_metadatas[i])
    268                 if self._add_start_index:

[/usr/local/lib/python3.10/dist-packages/langchain_experimental/text_splitter.py](https://localhost:8080/#) in split_text(self, text)
    226                 breakpoint_distance_threshold,
    227                 breakpoint_array,
--> 228             ) = self._calculate_breakpoint_threshold(distances)
    229 
    230         indices_above_thresh = [

[/usr/local/lib/python3.10/dist-packages/langchain_experimental/text_splitter.py](https://localhost:8080/#) in _calculate_breakpoint_threshold(self, distances)
    155         elif self.breakpoint_threshold_type == "gradient":
    156             # Calculate the threshold based on the distribution of gradient of distance array. # noqa: E501
--> 157             distance_gradient = np.gradient(distances, range(0, len(distances)))
    158             return cast(
    159                 float,

[/usr/local/lib/python3.10/dist-packages/numpy/lib/function_base.py](https://localhost:8080/#) in gradient(f, axis, edge_order, *varargs)
   1180             # if distances are constant reduce to the scalar case
   1181             # since it brings a consistent speedup
-> 1182             if (diffx == diffx[0]).all():
   1183                 diffx = diffx[0]
   1184             dx[i] = diffx

IndexError: index 0 is out of bounds for axis 0 with size 0

Description

I'm trying to use semantic chunker's gradient option to split text. It works great for this pdf, but not this one, and I don't know why. Percentile option works for both pdfs. I think this is either a bug in langchain or in the embedding model that I use. For now, I'll submit a bug report in langchain first.

System Info

langchain==0.2.16
langchain-chroma==0.1.3
langchain-community==0.2.16
langchain-core==0.2.38
langchain-experimental==0.0.65
langchain-huggingface==0.0.3
langchain-text-splitters==0.2.4

Python 3 Google Compute Engine backend

@luizguilhermedev
Copy link

Facing the same error here

@RowenTey
Copy link

Me too

@tibor-reiss
Copy link
Contributor

@Aryazaky Could you please share the full code, including how you loaded the pdf so I can debug? Have you tried to use a different breakpoint_threshold_amount instead of the default?

@Aryazaky
Copy link
Author

@tibor-reiss Unfortunately, my Colab has been modified many times since. But I think this was how I did it:

from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import PyMuPDFLoader

loader = DirectoryLoader(docs_path, glob="**/*.pdf", loader_cls=PyMuPDFLoader)
docs = loader.load()

Have you tried to use a different breakpoint_threshold_amount instead of the default?

No I haven't. I don't know what's the value range for that. What value do you suggest?

@luizguilhermedev
Copy link

I have tried to use different breakpoint_threshold_amout but i haven't got it working.

@tibor-reiss
Copy link
Contributor

@Aryazaky range is 0.0..100.0

@tibor-reiss
Copy link
Contributor

@Aryazaky The second pdf fails because the 3rd page splits into 2 sentences with the default regex. This results len(distances) == 1, for which np.gradient does not make sense.

Options:

  • adjust the regex
  • specify number_of_chunks in the constructor
  • wait for the PR which fixes this (will open asap)

@efriis efriis closed this as completed in a8b2413 Sep 20, 2024
sfc-gh-nmoiseyev pushed a commit to sfc-gh-nmoiseyev/langchain that referenced this issue Sep 21, 2024
Sheepsta300 pushed a commit to Sheepsta300/langchain that referenced this issue Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants