Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add ByteStream metadata and other metadata to Documents created by HTMLToDocument #6304

Merged
merged 8 commits into from
Nov 21, 2023

Conversation

awinml
Copy link
Contributor

@awinml awinml commented Nov 14, 2023

Related Issues

Proposed Changes:

  • Adds ByteStream metadata to Document metadata on creation of documents.
  • Adds an optional meta parameter in the run() method, that allows the users to pass additional metadata to the Documents.

How did you test it?

Unit tests, Integration tests and manual verification.

Notes for the reviewer

Usage examples showcasing the various methods for adding metadata to the documents can be viewed in this Colab Notebook. It also has an example utilizing HTMLToDocument with LinkContentFetcher.

Checklist

@awinml awinml requested a review from a team as a code owner November 14, 2023 13:53
@awinml awinml requested review from vblagoje and removed request for a team November 14, 2023 13:53
@github-actions github-actions bot added topic:tests 2.x Related to Haystack v2.0 type:documentation Improvements on the docs labels Nov 14, 2023
@awinml awinml requested a review from a team as a code owner November 14, 2023 13:57
@TuanaCelik
Copy link
Member

Hey @awinml thanks so much for the contribution 🚀
I'll be back in a 1.5 weeks but in the meantime let's see if someone from our team will review this PR.

@vblagoje
Copy link
Member

Hey @awinml , this looks like a good contribution; thank you! I'm wondering if you spoke with someone who asked you to add progress bars - as far as I know, we don't add them (at least for now) and having only this component with progress bars would introduce inconsistencies. If you added progress bars on your own initiative, would you please update this PR to remove them?

@awinml
Copy link
Contributor Author

awinml commented Nov 15, 2023

Thanks! @vblagoje

I saw that TextFileToDocument has a similar implementation for progress bars. I had also added it to MarkdownToDocument (#6159), hence thought this would be a good addition to HTMLToDocument.

I understand how this would introduce inconsistencies with the other file_converters, so I'll update the PR to remove it for now.

Copy link
Contributor

@agnieszka-m agnieszka-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation-wise, looks good!

Copy link
Member

@vblagoje vblagoje left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason these unit tests are marked as integration? If not, would you please update the integration tests into unit tests by removing the existing markers. In future, tests will be classified as unit tests by default unless marked as integration tests.

Additionally, consider using truthiness checks (e.g., if some_variable:) instead of is not None checks - if that's applicable. Seems like the code would work with both, but I just wanted to double check with you. And they are more compact and easier to read.

@awinml awinml requested a review from vblagoje November 21, 2023 15:30
@awinml
Copy link
Contributor Author

awinml commented Nov 21, 2023

@vblagoje, You're right! I've made the updates you suggested. I have marked the tests as unit tests and replaced the is not None checks with truthiness checks as per your recommendation.

Copy link
Member

@vblagoje vblagoje left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, looks good, let's 🚢 this one @awinml

@vblagoje vblagoje merged commit e6c8374 into deepset-ai:main Nov 21, 2023
22 checks passed
vblagoje pushed a commit that referenced this pull request Nov 22, 2023
…ated by `HTMLToDocument` (#6304)

* Refactor HTMLToDocument

* Add release notes

* Add additional tests

* remove progress bar

* Add additional test for metadata

* remove progress bar from release notes

* Update tests

* Use truthiness checks instead of is not None
@awinml awinml deleted the add_bytestream_meta_to_HTMLToDoc branch November 29, 2023 17:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.x Related to Haystack v2.0 topic:tests type:documentation Improvements on the docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HTMLToDocument to add ByteStream metadata to Document
4 participants