Releases: Unstructured-IO/unstructured
Releases · Unstructured-IO/unstructured
0.15.13
0.15.13
Enhancements
- Improve
pdfminer
image cleanup process. Optimized the removal of duplicated pdfminer images by performing the cleanup before merging elements, rather than after. This improvement reduces execution time and enhances overall processing speed of PDF documents.
Features
Fixes
- Fixes high memory overhead for intersection area computation Using
numpy.float32
for coordinates and remove intermediate variables to reduce memory usage when computing intersection areas - Fixes the
arm64
image buildarm64
builds are now fixed and will be available against starting with the0.15.13
release.
0.15.12
0.15.12
Enhancements
- Improve
pdfminer
element processing Implemented splitting ofpdfminer
elements (groups of text chunks) into smaller bounding boxes (text lines). This prevents loss of information from the object detection model and facilitates more effective removal of duplicatedpdfminer
text.
0.15.10
0.15.10
Enhancements
- Enhance
pdfminer
element cleanup Expand removal ofpdfminer
elements to include those inside allnon-pdfminer
elements, not justtables
. - Modified analysis drawing tools to dump to files and draw from dumps If the parameter
analysis
of thepartition_pdf
function is set toTrue
, the layout for Object Detection, Pdfminer Extraction, OCR and final layouts will be dumped as json files. The drawers now accept dict (dump) objects instead of internal classes instances. - Vectorize pdfminer elements deduplication computation. Use
numpy
operations to compute IOU and sub-region membership instead of using simply loop. This improves the speed of deduplicating elements for pages with a lot of elements.
Features
Fixes
0.15.9
0.15.9
Enhancements
Features
- Add support for encoding parameter in partition_csv
0.15.8
0.15.8
Enhancements
- Bump unstructured.paddleocr to 2.8.1.0.
Features
- Add MixedbreadAI embedder Adds MixedbreadAI embeddings to support embedding via Mixedbread AI.
Fixes
- Replace
pillow-heif
withpi-heif
. Replacespillow-heif
withpi-heif
due to more permissive licensing on the wheel forpi-heif
. - Minify text_as_html from DOCX. Previously
.metadata.text_as_html
for DOCX tables was "bloated" with whitespace and noise elements introduced bytabulate
that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count without preserving all text. - Fall back to filename extension-based file-type detection for unidentified OLE files. Resolves a problem where a DOC file that could not be detected as such by
filetype
was incorrectly identified as a MSG file.
0.15.7
0.15.7
Enhancements
Features
Fixes
- Fix NLTK data download path to prevent nested directories. Resolved an issue where a nested "nltk_data" directory was created within the parent "nltk_data" directory when it already existed. This fix prevents errors in checking for existing downloads and loading models from NLTK data.
0.15.6
0.15.6
Enhancements
Features
Fixes
- Bump to NLTK 3.9.x Bumps to the latest
nltk
version to resolve CVE. - Update CI for
ingest-test-fixture-update-pr
to resolve NLTK model download errors. - Synchronized text and html on
TableChunk
splits. When aTable
element is divided during chunking to fit the chunking window,TableChunk.text
corresponds exactly with the table text inTableChunk.metadata.text_as_html
,.text_as_html
is always parseable HTML, and the table is split on even row boundaries whenever possible.
0.15.5
0.15.5
Enhancements
Features
Fixes
- Revert to using
unstructured.pytesseract
fork. Due to the unavailability of some recent release versions ofpytesseract
on PyPI, the project now uses theunstructured.pytesseract
fork to ensure stability and continued support. - Bump
libreoffice
verson in image. Bumps thelibreoffice
version to25.2.5.2
to address CVEs. - Downgrade NLTK dependency version for compatibility. Due to the unavailability of
nltk==3.8.2
on PyPI, the NLTK dependency has been downgraded to<3.8.2
. This change ensures continued functionality and compatibility.
0.15.4
0.15.4
Enhancements
Features
Fixes
- Resolve an installation error with
pytesseract>=0.3.12
that occurred duringpip install unstructured[pdf]==0.15.3
.
0.15.3
0.15.3
Enhancements
Features
Fixes
- Remove the custom index URL from
extra-paddleocr.in
to resolve the error in thesetup.py
configuration.