community: Enhance MongoDBLoader with flexible metadata and optimized field extraction #23376

comsa33 · 2024-06-25T01:22:52Z

Description:

This pull request significantly enhances the MongodbLoader class in the LangChain community package by adding robust metadata customization and improved field extraction capabilities. The updated class now allows users to specify additional metadata fields through the metadata_names parameter, enabling the extraction of both top-level and deeply nested document attributes as metadata. This flexibility is crucial for users who need to include detailed contextual information without altering the database schema.

Moreover, the include_db_collection_in_metadata flag offers optional inclusion of database and collection names in the metadata, allowing for even greater customization depending on the user's needs.

The loader's field extraction logic has been refined to handle missing or nested fields more gracefully. It now employs a safe access mechanism that avoids the KeyError previously encountered when a specified nested field was absent in a document. This update ensures that the loader can handle diverse and complex data structures without failure, making it more resilient and user-friendly.

Issue:

This pull request addresses a critical issue where the MongodbLoader class in the LangChain community package could throw a KeyError when attempting to access nested fields that may not exist in some documents. The previous implementation did not handle the absence of specified nested fields gracefully, leading to runtime errors and interruptions in data processing workflows.

This enhancement ensures robust error handling by safely accessing nested document fields, using default values for missing data, thus preventing KeyError and ensuring smoother operation across various data structures in MongoDB. This improvement is crucial for users working with diverse and complex data sets, ensuring the loader can adapt to documents with varying structures without failing.

Dependencies:

Requires motor for asynchronous MongoDB interaction.

Twitter handle:

N/A

Add tests and docs

Tests: Unit tests have been added to verify that the metadata inclusion toggle works as expected and that the field extraction correctly handles nested fields.
Docs: An example notebook demonstrating the use of the enhanced MongodbLoader is included in the docs/docs/integrations directory. This notebook includes setup instructions, example usage, and outputs.
(Here is the notebook link : colab link)
Lint and test
Before submitting, I ran make format, make lint, and make test as per the contribution guidelines. All tests pass, and the code style adheres to the LangChain standards.

import unittest
from unittest.mock import patch, MagicMock
import asyncio
from langchain_community.document_loaders.mongodb import MongodbLoader

class TestMongodbLoader(unittest.TestCase):
    def setUp(self):
        """Setup the MongodbLoader test environment by mocking the motor client 
        and database collection interactions."""
        # Mocking the AsyncIOMotorClient
        self.mock_client = MagicMock()
        self.mock_db = MagicMock()
        self.mock_collection = MagicMock()

        self.mock_client.get_database.return_value = self.mock_db
        self.mock_db.get_collection.return_value = self.mock_collection

        # Initialize the MongodbLoader with test data
        self.loader = MongodbLoader(
            connection_string="mongodb://localhost:27017",
            db_name="testdb",
            collection_name="testcol"
        )

    @patch('langchain_community.document_loaders.mongodb.AsyncIOMotorClient', return_value=MagicMock())
    def test_constructor(self, mock_motor_client):
        """Test if the constructor properly initializes with the correct database and collection names."""
        loader = MongodbLoader(
            connection_string="mongodb://localhost:27017",
            db_name="testdb",
            collection_name="testcol"
        )
        self.assertEqual(loader.db_name, "testdb")
        self.assertEqual(loader.collection_name, "testcol")

    def test_aload(self):
        """Test the aload method to ensure it correctly queries and processes documents."""
        # Setup mock data and responses for the database operations
        self.mock_collection.count_documents.return_value = asyncio.Future()
        self.mock_collection.count_documents.return_value.set_result(1)
        self.mock_collection.find.return_value = [
            {"_id": "1", "content": "Test document content"}
        ]

        # Run the aload method and check responses
        loop = asyncio.get_event_loop()
        results = loop.run_until_complete(self.loader.aload())
        self.assertEqual(len(results), 1)
        self.assertEqual(results[0].page_content, "Test document content")

    def test_construct_projection(self):
        """Verify that the projection dictionary is constructed correctly based on field names."""
        self.loader.field_names = ['content', 'author']
        self.loader.metadata_names = ['timestamp']
        expected_projection = {'content': 1, 'author': 1, 'timestamp': 1}
        projection = self.loader._construct_projection()
        self.assertEqual(projection, expected_projection)

if __name__ == '__main__':
    unittest.main()

Additional Example for Documentation

Sample Data:

[
    {
        "_id": "1",
        "title": "Artificial Intelligence in Medicine",
        "content": "AI is transforming the medical industry by providing personalized medicine solutions.",
        "author": {
            "name": "John Doe",
            "email": "[email protected]"
        },
        "tags": ["AI", "Healthcare", "Innovation"]
    },
    {
        "_id": "2",
        "title": "Data Science in Sports",
        "content": "Data science provides insights into player performance and strategic planning in sports.",
        "author": {
            "name": "Jane Smith",
            "email": "[email protected]"
        },
        "tags": ["Data Science", "Sports", "Analytics"]
    }
]

Example Code:

loader = MongodbLoader(
    connection_string="mongodb://localhost:27017",
    db_name="example_db",
    collection_name="articles",
    filter_criteria={"tags": "AI"},
    field_names=["title", "content"],
    metadata_names=["author.name", "author.email"],
    include_db_collection_in_metadata=True
)

documents = loader.load()

for doc in documents:
    print("Page Content:", doc.page_content)
    print("Metadata:", doc.metadata)

Expected Output:

Page Content: Artificial Intelligence in Medicine AI is transforming the medical industry by providing personalized medicine solutions.
Metadata: {'author_name': 'John Doe', 'author_email': '[email protected]', 'database': 'example_db', 'collection': 'articles'}

Thank you.

Additional guidelines:

Make sure optional dependencies are imported within a function.
Please do not add dependencies to pyproject.toml files (even optional ones) unless they are required for unit tests.
Most PRs should not touch more than one package.
Changes should be backwards compatible.
If you are adding something to community, do not re-import it in langchain.

If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.

… field extraction ### Description: This pull request significantly enhances the MongodbLoader class in the LangChain community package by adding robust metadata customization and improved field extraction capabilities. The updated class now allows users to specify additional metadata fields through the metadata_names parameter, enabling the extraction of both top-level and deeply nested document attributes as metadata. This flexibility is crucial for users who need to include detailed contextual information without altering the database schema. Moreover, the include_db_collection_in_metadata flag offers optional inclusion of database and collection names in the metadata, allowing for even greater customization depending on the user's needs. The loader's field extraction logic has been refined to handle missing or nested fields more gracefully. It now employs a safe access mechanism that avoids the KeyError previously encountered when a specified nested field was absent in a document. This update ensures that the loader can handle diverse and complex data structures without failure, making it more resilient and user-friendly. ### Issue: This pull request addresses a critical issue where the MongodbLoader class in the LangChain community package could throw a KeyError when attempting to access nested fields that may not exist in some documents. The previous implementation did not handle the absence of specified nested fields gracefully, leading to runtime errors and interruptions in data processing workflows. This enhancement ensures robust error handling by safely accessing nested document fields, using default values for missing data, thus preventing KeyError and ensuring smoother operation across various data structures in MongoDB. This improvement is crucial for users working with diverse and complex data sets, ensuring the loader can adapt to documents with varying structures without failing. ### Dependencies: Requires motor for asynchronous MongoDB interaction. ### Twitter handle: N/A ### Add tests and docs Tests: Unit tests have been added to verify that the metadata inclusion toggle works as expected and that the field extraction correctly handles nested fields. Docs: An example notebook demonstrating the use of the enhanced MongodbLoader is included in the docs/docs/integrations directory. This notebook includes setup instructions, example usage, and outputs. Lint and test Before submitting, I ran make format, make lint, and make test as per the contribution guidelines. All tests pass, and the code style adheres to the LangChain standards. ### Additional Example for Documentation Sample Data: ```json [ { "_id": "1", "title": "Artificial Intelligence in Medicine", "content": "AI is transforming the medical industry by providing personalized medicine solutions.", "author": { "name": "John Doe", "email": "[email protected]" }, "tags": ["AI", "Healthcare", "Innovation"] }, { "_id": "2", "title": "Data Science in Sports", "content": "Data science provides insights into player performance and strategic planning in sports.", "author": { "name": "Jane Smith", "email": "[email protected]" }, "tags": ["Data Science", "Sports", "Analytics"] } ] ``` Example Code: ```python loader = MongodbLoader( connection_string="mongodb://localhost:27017", db_name="example_db", collection_name="articles", filter_criteria={"tags": "AI"}, field_names=["title", "content"], metadata_names=["author.name", "author.email"], include_db_collection_in_metadata=True ) documents = loader.load() for doc in documents: print("Page Content:", doc.page_content) print("Metadata:", doc.metadata) ``` Expected Output: ``` Page Content: Artificial Intelligence in Medicine AI is transforming the medical industry by providing personalized medicine solutions. Metadata: {'author.name': 'John Doe', 'author.email': '[email protected]', 'database': 'example_db', 'collection': 'articles'} ``` Thank you.

community: Enhance MongoDBLoader with flexible metadata and optimized…

vercel · 2024-06-25T01:22:56Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
langchain	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Jul 11, 2024 2:16pm

return metadata field name like below: - existing: data.job.detail - change: data_job_detail

Resolved an issue in the PDFMinerParser where the PDFObjRef objects were not being correctly handled as iterable, causing a TypeError. This was occurring because the `get_pages` function in the PDFMiner library returns page objects that might include PDFObjRef types, which are not directly iterable. Modifications include: - Using `resolve1` from pdfminer.pdfinterp to properly interpret PDFObjRef objects before attempting to iterate over them. - Added checks to ensure that the mediabox attributes, when present, are correctly processed as lists of resolved values. These changes ensure that the PDFMinerParser can handle PDF documents more robustly, preventing runtime errors when processing PDF files with complex structures or unusual attributes.

comsa33 · 2024-07-08T06:29:00Z

I've added a unit test for mongodbLoader but there's an error related to "libs/community/langchain_community/document_loaders/parsers/pdf.py" and Idk why...?
@ccurme

libs/community/langchain_community/document_loaders/mongodb.py

Jibola

LGTM!

Look out for: #25908
Depending on whose change comes in first, there will be some merge conflicts.

@patch

… field extraction (langchain-ai#23376) ### Description: This pull request significantly enhances the MongodbLoader class in the LangChain community package by adding robust metadata customization and improved field extraction capabilities. The updated class now allows users to specify additional metadata fields through the metadata_names parameter, enabling the extraction of both top-level and deeply nested document attributes as metadata. This flexibility is crucial for users who need to include detailed contextual information without altering the database schema. Moreover, the include_db_collection_in_metadata flag offers optional inclusion of database and collection names in the metadata, allowing for even greater customization depending on the user's needs. The loader's field extraction logic has been refined to handle missing or nested fields more gracefully. It now employs a safe access mechanism that avoids the KeyError previously encountered when a specified nested field was absent in a document. This update ensures that the loader can handle diverse and complex data structures without failure, making it more resilient and user-friendly. ### Issue: This pull request addresses a critical issue where the MongodbLoader class in the LangChain community package could throw a KeyError when attempting to access nested fields that may not exist in some documents. The previous implementation did not handle the absence of specified nested fields gracefully, leading to runtime errors and interruptions in data processing workflows. This enhancement ensures robust error handling by safely accessing nested document fields, using default values for missing data, thus preventing KeyError and ensuring smoother operation across various data structures in MongoDB. This improvement is crucial for users working with diverse and complex data sets, ensuring the loader can adapt to documents with varying structures without failing. ### Dependencies: Requires motor for asynchronous MongoDB interaction. ### Twitter handle: N/A ### Add tests and docs Tests: Unit tests have been added to verify that the metadata inclusion toggle works as expected and that the field extraction correctly handles nested fields. Docs: An example notebook demonstrating the use of the enhanced MongodbLoader is included in the docs/docs/integrations directory. This notebook includes setup instructions, example usage, and outputs. (Here is the notebook link : [colab link](https://colab.research.google.com/drive/1tp7nyUnzZa3dxEFF4Kc3KS7ACuNF6jzH?usp=sharing)) Lint and test Before submitting, I ran make format, make lint, and make test as per the contribution guidelines. All tests pass, and the code style adheres to the LangChain standards. ```python import unittest from unittest.mock import patch, MagicMock import asyncio from langchain_community.document_loaders.mongodb import MongodbLoader class TestMongodbLoader(unittest.TestCase): def setUp(self): """Setup the MongodbLoader test environment by mocking the motor client and database collection interactions.""" # Mocking the AsyncIOMotorClient self.mock_client = MagicMock() self.mock_db = MagicMock() self.mock_collection = MagicMock() self.mock_client.get_database.return_value = self.mock_db self.mock_db.get_collection.return_value = self.mock_collection # Initialize the MongodbLoader with test data self.loader = MongodbLoader( connection_string="mongodb://localhost:27017", db_name="testdb", collection_name="testcol" ) @patch('langchain_community.document_loaders.mongodb.AsyncIOMotorClient', return_value=MagicMock()) def test_constructor(self, mock_motor_client): """Test if the constructor properly initializes with the correct database and collection names.""" loader = MongodbLoader( connection_string="mongodb://localhost:27017", db_name="testdb", collection_name="testcol" ) self.assertEqual(loader.db_name, "testdb") self.assertEqual(loader.collection_name, "testcol") def test_aload(self): """Test the aload method to ensure it correctly queries and processes documents.""" # Setup mock data and responses for the database operations self.mock_collection.count_documents.return_value = asyncio.Future() self.mock_collection.count_documents.return_value.set_result(1) self.mock_collection.find.return_value = [ {"_id": "1", "content": "Test document content"} ] # Run the aload method and check responses loop = asyncio.get_event_loop() results = loop.run_until_complete(self.loader.aload()) self.assertEqual(len(results), 1) self.assertEqual(results[0].page_content, "Test document content") def test_construct_projection(self): """Verify that the projection dictionary is constructed correctly based on field names.""" self.loader.field_names = ['content', 'author'] self.loader.metadata_names = ['timestamp'] expected_projection = {'content': 1, 'author': 1, 'timestamp': 1} projection = self.loader._construct_projection() self.assertEqual(projection, expected_projection) if __name__ == '__main__': unittest.main() ``` ### Additional Example for Documentation Sample Data: ```json [ { "_id": "1", "title": "Artificial Intelligence in Medicine", "content": "AI is transforming the medical industry by providing personalized medicine solutions.", "author": { "name": "John Doe", "email": "[email protected]" }, "tags": ["AI", "Healthcare", "Innovation"] }, { "_id": "2", "title": "Data Science in Sports", "content": "Data science provides insights into player performance and strategic planning in sports.", "author": { "name": "Jane Smith", "email": "[email protected]" }, "tags": ["Data Science", "Sports", "Analytics"] } ] ``` Example Code: ```python loader = MongodbLoader( connection_string="mongodb://localhost:27017", db_name="example_db", collection_name="articles", filter_criteria={"tags": "AI"}, field_names=["title", "content"], metadata_names=["author.name", "author.email"], include_db_collection_in_metadata=True ) documents = loader.load() for doc in documents: print("Page Content:", doc.page_content) print("Metadata:", doc.metadata) ``` Expected Output: ``` Page Content: Artificial Intelligence in Medicine AI is transforming the medical industry by providing personalized medicine solutions. Metadata: {'author_name': 'John Doe', 'author_email': '[email protected]', 'database': 'example_db', 'collection': 'articles'} ``` Thank you. --- Additional guidelines: - Make sure optional dependencies are imported within a function. - Please do not add dependencies to pyproject.toml files (even optional ones) unless they are required for unit tests. - Most PRs should not touch more than one package. - Changes should be backwards compatible. - If you are adding something to community, do not re-import it in langchain. If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17. --------- Co-authored-by: ccurme <[email protected]>

comsa33 added 2 commits June 25, 2024 10:17

Merge pull request #2 from comsa33/comsa33-patch-2

d04e505

community: Enhance MongoDBLoader with flexible metadata and optimized…

comsa33 added 7 commits June 25, 2024 10:30

update: remove blanks, add comma for lint

23ea2f6

update: correction for lint

9a73742

update: line too long 92 > 88

9e9a3ac

Update: indentation

1f767aa

Update: fix lint style

e45202c

Update: remove space

9a368fd

Update: Sequence to List object(immutable to mutable)

bbecaeb

vercel bot deployed to Preview June 25, 2024 02:21 View deployment

comsa33 added 2 commits June 25, 2024 11:42

Update mongodb.py

667cef3

Update mongodb.py

cc28f09

vercel bot deployed to Preview June 25, 2024 03:11 View deployment

comsa33 added 2 commits June 25, 2024 14:00

Update: metadata field name format

4715493

return metadata field name like below: - existing: data.job.detail - change: data_job_detail

update: change single quote to double

0d474df

vercel bot deployed to Preview June 25, 2024 05:34 View deployment

comsa33 added 2 commits June 26, 2024 09:50

Merge branch 'master' into master

ffc24e3

Update mongodb.py

aa52985

comsa33 mentioned this pull request Jun 26, 2024

community: Enhance MongoDBLoader with flexible metadata and optimized field extraction #23073

Closed

comsa33 added 3 commits June 26, 2024 10:04

Update mongodb.py

df90c9c

Update mongodb.py

8ccbfce

Update mongodb.py

90f2eb0

vercel bot deployed to Preview June 26, 2024 01:29 View deployment

Merge branch 'master' into master

4c00fe6

Update test_mongodb.py

abdd67b

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Jul 8, 2024

Update test_mongodb.py

d25e186

vercel bot deployed to Preview July 8, 2024 02:10 View deployment

Update test_mongodb.py

7571242

vercel bot deployed to Preview July 8, 2024 04:44 View deployment

comsa33 added 9 commits July 8, 2024 13:54

Update test_mongodb.py

583a186

Update test_mongodb.py

3267641

Update test_mongodb.py

a57ba13

Update test_mongodb.py

6bdef29

Update test_mongodb.py

8ddbc78

Update test_mongodb.py

d0017d8

Update test_mongodb.py

f15fb42

Update test_mongodb.py

35d00fc

Update test_mongodb.py

cf9047b

vercel bot deployed to Preview July 8, 2024 06:03 View deployment

comsa33 added 2 commits July 8, 2024 15:20

Update pdf.py

ce4f061

vercel bot deployed to Preview July 8, 2024 06:54 View deployment

Merge branch 'master' into master

62bb390

vercel bot deployed to Preview July 11, 2024 14:16 View deployment

efriis assigned ccurme Sep 3, 2024

Jibola approved these changes Sep 4, 2024

View reviewed changes

libs/community/langchain_community/document_loaders/mongodb.py Show resolved Hide resolved

Jibola reviewed Sep 4, 2024

View reviewed changes

ccurme approved these changes Sep 17, 2024

View reviewed changes

dosubot bot added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Sep 17, 2024

ccurme merged commit 0a177ec into langchain-ai:master Sep 17, 2024
42 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

community: Enhance MongoDBLoader with flexible metadata and optimized field extraction #23376

community: Enhance MongoDBLoader with flexible metadata and optimized field extraction #23376

comsa33 commented Jun 25, 2024 •

edited

Loading

vercel bot commented Jun 25, 2024 •

edited

Loading

comsa33 commented Jul 8, 2024

Jibola left a comment

community: Enhance MongoDBLoader with flexible metadata and optimized field extraction #23376

community: Enhance MongoDBLoader with flexible metadata and optimized field extraction #23376

Conversation

comsa33 commented Jun 25, 2024 • edited Loading

Description:

Issue:

Dependencies:

Twitter handle:

Add tests and docs

Additional Example for Documentation

vercel bot commented Jun 25, 2024 • edited Loading

comsa33 commented Jul 8, 2024

Jibola left a comment

Choose a reason for hiding this comment

comsa33 commented Jun 25, 2024 •

edited

Loading

vercel bot commented Jun 25, 2024 •

edited

Loading