0.10.29

yuming-long released this 07 Nov 01:34

· 642 commits to main since this release

0.10.29

Enhancements

Add include_header argument for partition_csv and partition_tsv Now supports retaining header rows in CSV and TSV documents element partitioning.
Add retry logic for all source connectors All http calls being made by the ingest source connectors have been isolated and wrapped by the SourceConnectionNetworkError custom error, which triggers the retry logic, if enabled, in the ingest pipeline.
Google Drive source connector supports credentials from memory Originally, the connector expected a filepath to pull the credentials from when creating the client. This was expanded to support passing that information from memory as a dict if access to the file system might not be available.
Add support for generic partition configs in ingest cli Along with the explicit partition options supported by the cli, an additional_partition_args arg was added to allow users to pass in any other arguments that should be added when calling partition(). This helps keep any changes to the input parameters of the partition() exposed in the CLI.
Map full output schema for table-based destination connectors A full schema was introduced to map the type of all output content from the json partition output and mapped to a flattened table structure to leverage table-based destination connectors. The delta table destination connector was updated at the moment to take advantage of this.
Incorporate multiple embedding model options into ingest, add diff test embeddings Problem: Ingest pipeline already supported embedding functionality, however users might want to use different types of embedding providers. Enhancement: Extend ingest pipeline so that users can specify and embed via a particular embedding provider from a range of options. Also adds a diff test to compare output from an embedding module with the expected output

Features

Allow setting table crop parameter In certain circumstances, adjusting the table crop padding may improve table.

Fixes

Fixes partition_text to prevent empty elements Adds a check to filter out empty bullets.
Handle empty string for ocr_languages with values for languages Some API users ran into an issue with sending languages params because the API defaulted to also using an empty string for ocr_languages. This update handles situations where languages is defined and ocr_languages is an empty string.
Fix PDF tried to loop through None Previously the PDF annotation extraction tried to loop through annots that resolved out as None. A logical check added to avoid such error.
Ingest session handler not being shared correctly All ingest docs that leverage the session handler should only need to set it once per process. It was recreating it each time because the right values weren't being set nor available given how dataclasses work in python.
Ingest download-only fix. Previously the download only flag was being checked after the doc factory pipeline step, which occurs before the files are actually downloaded by the source node. This check was moved after the source node to allow for the files to be downloaded first before exiting the pipeline.
Fix flaky chunk-metadata. Prior implementation was sensitive to element order in the section resulting in metadata values sometimes being dropped. Also, not all metadata items can be consolidated across multiple elements (e.g. coordinates) and so are now dropped from consolidated metadata.
Fix tesseract error Estimating resolution as X leaded by invalid language parameters input. Proceed with defalut language eng when lang.py fails to find valid language code for tesseract, so that we don't pass an empty string to tesseract CLI and raise an exception in downstream.

Assets 2