Fix: partition pdf overflow error (#2054)

Closes #2050. ### Summary - set zoom to `1` if zoom is less than `0` when parsing Tesseract OCR data - update `determine_pdf_auto_strategy` to return the `hi_res` strategy if either `infer_table_structure` or `extract_images_in_pdf` is true ### Testing PDF: [getty_62-62.pdf](https://github.com/Unstructured-IO/unstructured/files/13322169/getty_62-62.pdf) Run the following code in both the `main` branch and the `current` branch. ``` from unstructured.partition.pdf import partition_pdf elements = partition_pdf( filename="getty_62-62.pdf", extract_images_in_pdf=True, infer_table_structure=True, chunking_strategy="by_title", max_characters=4000, new_after_n_chars=3800, combine_text_under_n_chars=2000, image_output_dir_path=path, ) ```
Unstructured-IO · Nov 10, 2023 · b11c546 · b11c546
1 parent f8c180a
commit b11c546
Show file tree

Hide file tree

Showing 5 changed files with 13 additions and 3 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,4 +1,4 @@
-## 0.10.30-dev5
+## 0.10.30
 
 ### Enhancements
 
@@ -12,6 +12,8 @@
 
 ### Fixes
 
+* **Fix logic that determines pdf auto strategy.** Previously, `_determine_pdf_auto_strategy` returned `hi_res` strategy only if `infer_table_structure` was true. It now returns the `hi_res` strategy if either `infer_table_structure` or `extract_images_in_pdf` is true.   
+* **Fix invalid coordinates when parsing tesseract ocr data.** Previously, when parsing tesseract ocr data, the ocr data had invalid bboxes if zoom was set to `0`. A logical check is now added to avoid such error. 
 * **Fix ingest partition parameters not being passed to the api.** When using the --partition-by-api flag via unstructured-ingest, none of the partition arguments are forwarded, meaning that these options are disregarded. With this change, we now pass through all of the relevant partition arguments to the api. This allows a user to specify all of the same partition arguments they would locally and have them respected when specifying --partition-by-api.
 * **Support tables in section-less DOCX.** Generalize solution for MS Chat Transcripts exported as DOCX by including tables in the partitioned output when present.
 * **Support tables that contain only numbers when partitioning via `ocr_only`** Tables that contain only numbers are returned as floats in a pandas.DataFrame when the image is converted from `.image_to_data()`. An AttributeError was raised downstream when trying to `.strip()` the floats.

diff --git a/unstructured/__version__.py b/unstructured/__version__.py
@@ -1 +1 @@
-__version__ = "0.10.30-dev5"  # pragma: no cover
+__version__ = "0.10.30"  # pragma: no cover
diff --git a/unstructured/partition/ocr.py b/unstructured/partition/ocr.py
@@ -528,6 +528,9 @@ def parse_ocr_data_tesseract(ocr_data: pd.DataFrame, zoom: float = 1) -> List[Te
       data frame will result in its associated bounding box being ignored.
     """
 
+    if zoom <= 0:
+        zoom = 1
+
     text_regions = []
     for idtx in ocr_data.itertuples():
         text = idtx.text

diff --git a/unstructured/partition/pdf.py b/unstructured/partition/pdf.py
@@ -281,6 +281,7 @@ def partition_pdf_or_image(
             file=file,
             is_image=is_image,
             infer_table_structure=infer_table_structure,
+            extract_images_in_pdf=extract_images_in_pdf,
         )
         != "ocr_only"
     ):
@@ -304,6 +305,7 @@ def partition_pdf_or_image(
         is_image=is_image,
         infer_table_structure=infer_table_structure,
         pdf_text_extractable=pdf_text_extractable,
+        extract_images_in_pdf=extract_images_in_pdf,
     )
 
     if strategy == "hi_res":

diff --git a/unstructured/partition/strategies.py b/unstructured/partition/strategies.py
@@ -39,6 +39,7 @@ def determine_pdf_or_image_strategy(
     is_image: bool = False,
     infer_table_structure: bool = False,
     pdf_text_extractable: bool = True,
+    extract_images_in_pdf: bool = False,
 ):
     """Determines what strategy to use for processing PDFs or images, accounting for fallback
     logic if some dependencies are not available."""
@@ -62,6 +63,7 @@ def determine_pdf_or_image_strategy(
             strategy = _determine_pdf_auto_strategy(
                 pdf_text_extractable=pdf_text_extractable,
                 infer_table_structure=infer_table_structure,
+                extract_images_in_pdf=extract_images_in_pdf,
             )
 
     if file is not None:
@@ -124,12 +126,13 @@ def _determine_image_auto_strategy():
 def _determine_pdf_auto_strategy(
     pdf_text_extractable: bool = True,
     infer_table_structure: bool = False,
+    extract_images_in_pdf: bool = False,
 ):
     """If "auto" is passed in as the strategy, determines what strategy to use
     for PDFs."""
     # NOTE(robinson) - Currrently "hi_res" is the only stategy where
     # infer_table_structure is used.
-    if infer_table_structure:
+    if infer_table_structure or extract_images_in_pdf:
         return "hi_res"
 
     if pdf_text_extractable: