If distances is empty in the 'gradient' option of semantic chunker it causes IndexError. #26221

Aryazaky · 2024-09-09T13:21:10Z

Checked other resources

I added a very descriptive title to this issue.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.
I am sure that this is a bug in LangChain rather than my code.
The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

from langchain_huggingface import HuggingFaceEmbeddings

embedding_model_name = "LazarusNLP/all-indo-e5-small-v4"

embedding_model = HuggingFaceEmbeddings(
    model_name = embedding_model_name,
    encode_kwargs = {'normalize_embeddings': True}
    )

from langchain_experimental.text_splitter import SemanticChunker

text_splitter = SemanticChunker(
    embedding_model,
    breakpoint_threshold_type = "gradient",
    )
chunks = text_splitter.split_documents(docs) # docs are pages of a pdf loaded using pymupdf

Error Message and Stack Trace (if applicable)

IndexError                                Traceback (most recent call last)
[<ipython-input-9-dd360b9e5d4f>](https://localhost:8080/#) in <cell line: 31>()
     29     return chunks
     30 
---> 31 chunks = text_splitter.split_documents(docs)
     32 chunks = calculate_chunk_ids(chunks)
     33 print(f"Split into {len(chunks)} chunks")

4 frames
[/usr/local/lib/python3.10/dist-packages/langchain_experimental/text_splitter.py](https://localhost:8080/#) in split_documents(self, documents)
    279             texts.append(doc.page_content)
    280             metadatas.append(doc.metadata)
--> 281         return self.create_documents(texts, metadatas=metadatas)
    282 
    283     def transform_documents(

[/usr/local/lib/python3.10/dist-packages/langchain_experimental/text_splitter.py](https://localhost:8080/#) in create_documents(self, texts, metadatas)
    264         for i, text in enumerate(texts):
    265             start_index = 0
--> 266             for chunk in self.split_text(text):
    267                 metadata = copy.deepcopy(_metadatas[i])
    268                 if self._add_start_index:

[/usr/local/lib/python3.10/dist-packages/langchain_experimental/text_splitter.py](https://localhost:8080/#) in split_text(self, text)
    226                 breakpoint_distance_threshold,
    227                 breakpoint_array,
--> 228             ) = self._calculate_breakpoint_threshold(distances)
    229 
    230         indices_above_thresh = [

[/usr/local/lib/python3.10/dist-packages/langchain_experimental/text_splitter.py](https://localhost:8080/#) in _calculate_breakpoint_threshold(self, distances)
    155         elif self.breakpoint_threshold_type == "gradient":
    156             # Calculate the threshold based on the distribution of gradient of distance array. # noqa: E501
--> 157             distance_gradient = np.gradient(distances, range(0, len(distances)))
    158             return cast(
    159                 float,

[/usr/local/lib/python3.10/dist-packages/numpy/lib/function_base.py](https://localhost:8080/#) in gradient(f, axis, edge_order, *varargs)
   1180             # if distances are constant reduce to the scalar case
   1181             # since it brings a consistent speedup
-> 1182             if (diffx == diffx[0]).all():
   1183                 diffx = diffx[0]
   1184             dx[i] = diffx

IndexError: index 0 is out of bounds for axis 0 with size 0

Description

I'm trying to use semantic chunker's gradient option to split text. It works great for this pdf, but not this one, and I don't know why. Percentile option works for both pdfs. I think this is either a bug in langchain or in the embedding model that I use. For now, I'll submit a bug report in langchain first.

System Info

langchain==0.2.16
langchain-chroma==0.1.3
langchain-community==0.2.16
langchain-core==0.2.38
langchain-experimental==0.0.65
langchain-huggingface==0.0.3
langchain-text-splitters==0.2.4

Python 3 Google Compute Engine backend

The text was updated successfully, but these errors were encountered:

luizguilhermedev · 2024-09-09T19:07:31Z

Facing the same error here

RowenTey · 2024-09-13T09:23:31Z

Me too

tibor-reiss · 2024-09-15T09:22:51Z

@Aryazaky Could you please share the full code, including how you loaded the pdf so I can debug? Have you tried to use a different breakpoint_threshold_amount instead of the default?

Aryazaky · 2024-09-17T03:49:33Z

@tibor-reiss Unfortunately, my Colab has been modified many times since. But I think this was how I did it:

from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import PyMuPDFLoader

loader = DirectoryLoader(docs_path, glob="**/*.pdf", loader_cls=PyMuPDFLoader)
docs = loader.load()

Have you tried to use a different breakpoint_threshold_amount instead of the default?

No I haven't. I don't know what's the value range for that. What value do you suggest?

luizguilhermedev · 2024-09-17T17:42:10Z

I have tried to use different breakpoint_threshold_amout but i haven't got it working.

tibor-reiss · 2024-09-18T04:37:35Z

@Aryazaky range is 0.0..100.0

tibor-reiss · 2024-09-18T19:31:52Z

@Aryazaky The second pdf fails because the 3rd page splits into 2 sentences with the default regex. This results len(distances) == 1, for which np.gradient does not make sense.

Options:

adjust the regex
specify number_of_chunks in the constructor
wait for the PR which fixes this (will open asap)

Fixes langchain-ai#26221 --------- Co-authored-by: Erick Friis <[email protected]>

tibor-reiss mentioned this issue Sep 18, 2024

fix[experimental]: Fix text splitter with gradient #26629

Merged

efriis closed this as completed in #26629 Sep 20, 2024

efriis closed this as completed in a8b2413 Sep 20, 2024

sfc-gh-nmoiseyev pushed a commit to sfc-gh-nmoiseyev/langchain that referenced this issue Sep 21, 2024

fix[experimental]: Fix text splitter with gradient (langchain-ai#26629)

261cae5

Fixes langchain-ai#26221 --------- Co-authored-by: Erick Friis <[email protected]>

Sheepsta300 pushed a commit to Sheepsta300/langchain that referenced this issue Oct 1, 2024

fix[experimental]: Fix text splitter with gradient (langchain-ai#26629)

70671b5

Fixes langchain-ai#26221 --------- Co-authored-by: Erick Friis <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

If distances is empty in the 'gradient' option of semantic chunker it causes IndexError. #26221

If distances is empty in the 'gradient' option of semantic chunker it causes IndexError. #26221

Aryazaky commented Sep 9, 2024

luizguilhermedev commented Sep 9, 2024

RowenTey commented Sep 13, 2024

tibor-reiss commented Sep 15, 2024

Aryazaky commented Sep 17, 2024

luizguilhermedev commented Sep 17, 2024

tibor-reiss commented Sep 18, 2024

tibor-reiss commented Sep 18, 2024

If distances is empty in the 'gradient' option of semantic chunker it causes IndexError. #26221

If distances is empty in the 'gradient' option of semantic chunker it causes IndexError. #26221

Comments

Aryazaky commented Sep 9, 2024

Checked other resources

Example Code

Error Message and Stack Trace (if applicable)

Description

System Info

luizguilhermedev commented Sep 9, 2024

RowenTey commented Sep 13, 2024

tibor-reiss commented Sep 15, 2024

Aryazaky commented Sep 17, 2024

luizguilhermedev commented Sep 17, 2024

tibor-reiss commented Sep 18, 2024

tibor-reiss commented Sep 18, 2024