Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(datasets): Added DeltaTableDataSet #243

Merged
merged 21 commits into from
Jul 21, 2023

Conversation

afaqueahmad7117
Copy link
Contributor

@afaqueahmad7117 afaqueahmad7117 commented Jun 19, 2023

Description

This PR adds DeltaTableDataSet to the Kedro datasets plugin. Issue 226.

Development notes

  1. DeltaTableDataSet here kedro-datasets/kedro_datasets/pandas/deltatable_dataset.py
  2. Unit Tests kedro-datasets/tests/pandas/test_deltatable_dataset.py

Checklist

  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the relevant RELEASE.md file
  • Added tests to cover my changes

Signed-off-by: k9 <[email protected]>
@everdark
Copy link
Contributor

everdark commented Jun 19, 2023

Just a side note, there is a bug for the delta-rs python binding for using catalog: delta-io/delta-rs#1466; so right now I'm not able to smoke test the from_data_catalog method to see if it actually works.

@noklam
Copy link
Contributor

noklam commented Jun 19, 2023

@everdark Is this ready to be review now? I saw you have closed the old PR but this one is still in draft.

@everdark
Copy link
Contributor

@everdark Is this ready to be review now? I saw you have closed the old PR but this one is still in draft.

Depending on whether @afaqueahmad7117 is doing some more works. I don't have anything to add from my side as of now. We close the old PR cause there are some mess-up in the commits.

@afaqueahmad7117 afaqueahmad7117 marked this pull request as ready for review June 20, 2023 05:44
@afaqueahmad7117
Copy link
Contributor Author

@noklam @everdark I'm done with my changes. Marked PR ready for review.

@noklam noklam added the Community Issue/PR opened by the open-source community label Jun 20, 2023
@noklam noklam self-requested a review June 20, 2023 10:26
kedro-datasets/.pylintrc Outdated Show resolved Hide resolved
kedro-datasets/setup.py Outdated Show resolved Hide resolved
kedro-datasets/tests/pandas/test_deltatable_dataset.py Outdated Show resolved Hide resolved
@noklam
Copy link
Contributor

noklam commented Jun 20, 2023

Noted that the CI is failing for some other reason, kedro-org/kedro#2673 would fix this, so ignore the irrelevant error or try to install a slightly older kedro version to test it locally.

@astrojuanlu
Copy link
Member

The new dataset does not yet have 100 % coverage, some lines are missing:

kedro_datasets/pandas/deltatable_dataset.py                66      3    95%   153, 211, 216

@everdark
Copy link
Contributor

The new dataset does not yet have 100 % coverage, some lines are missing:

kedro_datasets/pandas/deltatable_dataset.py                66      3    95%   153, 211, 216

Coverage increased to 100% in 2b1c5e8.

Signed-off-by: Kyle Chung <[email protected]>
@astrojuanlu
Copy link
Member

Thanks a lot folks, this is looking really good!

Just to help me contextualize and sorry for the basic question: but could you summarize what is the difference between this DeltaTableDataSet and the existing ManagedTableDataSet? I see the latter can be used with Delta tables too https://kedro.org/blog/managed-delta-tables-kedro-dataset

@everdark
Copy link
Contributor

everdark commented Jul 13, 2023

Thanks a lot folks, this is looking really good!

Just to help me contextualize and sorry for the basic question: but could you summarize what is the difference between this DeltaTableDataSet and the existing ManagedTableDataSet? I see the latter can be used with Delta tables too https://kedro.org/blog/managed-delta-tables-kedro-dataset

The ManagedTableDataSet is for delta table managed by Databricks Unity Catalog. It is a very specific (proprietary) implementation on top of the open source delta table. Our dataset is a generic non-spark solution to handle the open source version of the delta table.

By adding this into kedro-datasets, there will be 3 possible ways of handling delta table:

  1. Apache Spark
  2. delta-rs, a non-Spark approach (this PR)
  3. Databricks Unity Catalog

The user should decide which one best suits their need.

@noklam
Copy link
Contributor

noklam commented Jul 18, 2023

By adding this into kedro-datasets, there will be 3 possible ways of handling delta table:
Apache Spark
delta-rs, a non-Spark approach (this PR)
Databricks Unity Catalog

Agree, although DeltaTable is often associated with Spark, but it's actually just a file format and you can read it via Pandas or maybe other libraries later.

I think the current space is still dominated by Parquet for data processing, Delta files are usually larger due to the compression and version history. IMO the versioning features are quite important and it deserves wider adoption outside of the Spark ecosystem.

I have no idea how the compacting and re-partitioning works with non-spark implementation? This feels like the responsibility of some kind of DB or data processing engine, it's probably too much for the Dataset abstraction. WDYT?

@everdark
Copy link
Contributor

everdark commented Jul 18, 2023

By adding this into kedro-datasets, there will be 3 possible ways of handling delta table:
Apache Spark
delta-rs, a non-Spark approach (this PR)
Databricks Unity Catalog

Agree, although DeltaTable is often associated with Spark, but it's actually just a file format and you can read it via Pandas or maybe other libraries later.

I think the current space is still dominated by Parquet for data processing, Delta files are usually larger due to the compression and version history. IMO the versioning features are quite important and it deserves wider adoption outside of the Spark ecosystem.

I have no idea how the compacting and re-partitioning works with non-spark implementation? This feels like the responsibility of some kind of DB or data processing engine, it's probably too much for the Dataset abstraction. WDYT?

It will be mostly based on Apache Arrow. It will make sure we don't need to load all files into memory.
Advanced use case can be one leveraging DuckDB as the in-memory query engine, and with Arrow as the backend.

Some examples (in terms of analytical query) can be found here:
https://delta-io.github.io/delta-rs/python/usage.html#querying-delta-tables

Copy link
Contributor

@SajidAlamQB SajidAlamQB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution @afaqueahmad7117! Your DeltaTableDataSet class looks great.

@afaqueahmad7117
Copy link
Contributor Author

afaqueahmad7117 commented Jul 21, 2023

Thanks for your contribution @afaqueahmad7117! Your DeltaTableDataSet class looks great.

Thank you! @everdark too!

@noklam
Copy link
Contributor

noklam commented Jul 21, 2023

Thank you both! @afaqueahmad7117 @everdark, I will take over from now to fix the conflict and merge.

@noklam
Copy link
Contributor

noklam commented Jul 21, 2023

Moving requirements around as we moved from setup.pu -> pyproject.toml

@astrojuanlu
Copy link
Member

Let's get this merged 🚀 thanks @afaqueahmad7117 @everdark and team!

@noklam noklam merged commit 8fb01ef into kedro-org:main Jul 21, 2023
PtrBld pushed a commit to PtrBld/kedro-plugins that referenced this pull request Aug 27, 2023
* feat: added delta table dataset

Signed-off-by: Afaque Ahmad <[email protected]>

* test: lint

Signed-off-by: k9 <[email protected]>

* chore: adjusted docstring line length

Signed-off-by: Afaque Ahmad <[email protected]>

* chore: fix requirements order

Signed-off-by: k9 <[email protected]>

* chore: add .pylintrc to ignore line too long for url

Signed-off-by: k9 <[email protected]>

* chore: remove invalid noqa comment

Signed-off-by: k9 <[email protected]>

* fix: do not import TableNotFoundError from private module to avoid pylint complains

Signed-off-by: k9 <[email protected]>

* fix: fixed linting issues

Signed-off-by: Afaque Ahmad <[email protected]>

* Move pylintrc to pyproject.toml

Signed-off-by: Nok <[email protected]>

* Fix pylint config

Signed-off-by: Nok <[email protected]>

* test: use mocker fixture to replace unittest.mock

Signed-off-by: k9 <[email protected]>

* chore: lint for line too long

Signed-off-by: k9 <[email protected]>

* test: increase coverage for pandas delta table dataset

Signed-off-by: Kyle Chung <[email protected]>

* chore: lint

Signed-off-by: Kyle Chung <[email protected]>

---------

Signed-off-by: Afaque Ahmad <[email protected]>
Signed-off-by: k9 <[email protected]>
Signed-off-by: Afaque Ahmad <[email protected]>
Signed-off-by: Nok <[email protected]>
Signed-off-by: Kyle Chung <[email protected]>
Co-authored-by: k9 <[email protected]>
Co-authored-by: Nok <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Community Issue/PR opened by the open-source community
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants