Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add metadata to ingestion routes #2008

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

NathanLenas
Copy link

Description

This PR adds the ability to specify additional metadata in text and file ingestion routes.

This feature might be useful for people creating custom UIs with PrivateGPT and that want to add additional informations to the document sources, or for people using a custom ingestion pipeline with readers that may use the metadata. This feature could also be extended to bulk ingestion, but this would be a good start.

This is an important part of the project, so I wanted to make sure nothing breaks.
I added two unit tests, one for the text ingestion route and one for the file ingestion route.
I also tested all of the ingestion modes to make sure nothing breaks.

Type of Change

Please delete options that are not relevant.

  • New feature (non-breaking change which adds functionality)

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

  • Added new unit/integration tests
  • I stared at the code and made sure it makes sense

Test Configuration:

  • Firmware version: 0.5

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • [] Any dependent changes have been merged and published in downstream modules
  • I ran make check; make test to ensure mypy and tests pass

self,
file_name: str,
file_data: Path,
file_metadata: dict[str, str] | None = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say it is better idea to type like dict[str, Any] | None

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something I thought about because of llama index's restriction for documents: the metadata can only be a flat value (string, int, float, see https://docs.llamaindex.ai/en/stable/module_guides/loading/documents_and_nodes/usage_documents/#metadata). By using string I made it "fool proof". But lmk if you want Any, i will change it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it is better idea try to convert data when we're storing data into vector store that compromise all application with this decision. What do you think @imartinez?

private_gpt/server/ingest/ingest_router.py Outdated Show resolved Hide resolved
files = {
"file": (path.name, path.open("rb")),
"metadata": (None, json.dumps(metadata)),
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as below. Ideally, it would say it has to be something like this:

files = {
  "file": (path.name, path.open("rb")),
   "metadata": metadata
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants