feat(datasets): Added `DeltaTableDataSet` #243

afaqueahmad7117 · 2023-06-19T07:14:50Z

Description

This PR adds DeltaTableDataSet to the Kedro datasets plugin. Issue 226.

Development notes

DeltaTableDataSet here kedro-datasets/kedro_datasets/pandas/deltatable_dataset.py
Unit Tests kedro-datasets/tests/pandas/test_deltatable_dataset.py

Checklist

Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the relevant RELEASE.md file
Added tests to cover my changes

Signed-off-by: Afaque Ahmad <[email protected]>

Signed-off-by: k9 <[email protected]>

everdark · 2023-06-19T07:35:12Z

Just a side note, there is a bug for the delta-rs python binding for using catalog: delta-io/delta-rs#1466; so right now I'm not able to smoke test the from_data_catalog method to see if it actually works.

Signed-off-by: Afaque Ahmad <[email protected]>

…queahmad7117/kedro-plugins into feat/non-spark-delta-dataset Signed-off-by: Afaque Ahmad <[email protected]>

noklam · 2023-06-19T09:48:47Z

@everdark Is this ready to be review now? I saw you have closed the old PR but this one is still in draft.

everdark · 2023-06-19T09:51:29Z

@everdark Is this ready to be review now? I saw you have closed the old PR but this one is still in draft.

Depending on whether @afaqueahmad7117 is doing some more works. I don't have anything to add from my side as of now. We close the old PR cause there are some mess-up in the commits.

afaqueahmad7117 · 2023-06-20T07:25:16Z

@noklam @everdark I'm done with my changes. Marked PR ready for review.

Signed-off-by: k9 <[email protected]>

…lint complains Signed-off-by: k9 <[email protected]>

Signed-off-by: Afaque Ahmad <[email protected]>

Signed-off-by: Nok <[email protected]>

kedro-datasets/.pylintrc

kedro-datasets/setup.py

kedro-datasets/tests/pandas/test_deltatable_dataset.py

kedro-datasets/kedro_datasets/pandas/deltatable_dataset.py

kedro-datasets/tests/pandas/test_deltatable_dataset.py

noklam · 2023-06-20T13:07:25Z

Noted that the CI is failing for some other reason, kedro-org/kedro#2673 would fix this, so ignore the irrelevant error or try to install a slightly older kedro version to test it locally.

Signed-off-by: k9 <[email protected]>

kedro-datasets/kedro_datasets/pandas/deltatable_dataset.py

Signed-off-by: Afaque Ahmad <[email protected]>

astrojuanlu · 2023-07-13T08:06:29Z

The new dataset does not yet have 100 % coverage, some lines are missing:

kedro_datasets/pandas/deltatable_dataset.py                66      3    95%   153, 211, 216

Signed-off-by: Kyle Chung <[email protected]>

everdark · 2023-07-13T09:16:15Z

The new dataset does not yet have 100 % coverage, some lines are missing:
kedro_datasets/pandas/deltatable_dataset.py                66      3    95%   153, 211, 216

Coverage increased to 100% in 2b1c5e8.

Signed-off-by: Kyle Chung <[email protected]>

astrojuanlu · 2023-07-13T10:12:58Z

Thanks a lot folks, this is looking really good!

Just to help me contextualize and sorry for the basic question: but could you summarize what is the difference between this DeltaTableDataSet and the existing ManagedTableDataSet? I see the latter can be used with Delta tables too https://kedro.org/blog/managed-delta-tables-kedro-dataset

everdark · 2023-07-13T10:16:50Z

Thanks a lot folks, this is looking really good!

Just to help me contextualize and sorry for the basic question: but could you summarize what is the difference between this DeltaTableDataSet and the existing ManagedTableDataSet? I see the latter can be used with Delta tables too https://kedro.org/blog/managed-delta-tables-kedro-dataset

The ManagedTableDataSet is for delta table managed by Databricks Unity Catalog. It is a very specific (proprietary) implementation on top of the open source delta table. Our dataset is a generic non-spark solution to handle the open source version of the delta table.

By adding this into kedro-datasets, there will be 3 possible ways of handling delta table:

Apache Spark
delta-rs, a non-Spark approach (this PR)
Databricks Unity Catalog

The user should decide which one best suits their need.

noklam · 2023-07-18T10:51:28Z

By adding this into kedro-datasets, there will be 3 possible ways of handling delta table:
Apache Spark
delta-rs, a non-Spark approach (this PR)
Databricks Unity Catalog

Agree, although DeltaTable is often associated with Spark, but it's actually just a file format and you can read it via Pandas or maybe other libraries later.

I think the current space is still dominated by Parquet for data processing, Delta files are usually larger due to the compression and version history. IMO the versioning features are quite important and it deserves wider adoption outside of the Spark ecosystem.

I have no idea how the compacting and re-partitioning works with non-spark implementation? This feels like the responsibility of some kind of DB or data processing engine, it's probably too much for the Dataset abstraction. WDYT?

everdark · 2023-07-18T11:31:36Z

By adding this into kedro-datasets, there will be 3 possible ways of handling delta table:
Apache Spark
delta-rs, a non-Spark approach (this PR)
Databricks Unity Catalog

Agree, although DeltaTable is often associated with Spark, but it's actually just a file format and you can read it via Pandas or maybe other libraries later.

I think the current space is still dominated by Parquet for data processing, Delta files are usually larger due to the compression and version history. IMO the versioning features are quite important and it deserves wider adoption outside of the Spark ecosystem.

I have no idea how the compacting and re-partitioning works with non-spark implementation? This feels like the responsibility of some kind of DB or data processing engine, it's probably too much for the Dataset abstraction. WDYT?

It will be mostly based on Apache Arrow. It will make sure we don't need to load all files into memory.
Advanced use case can be one leveraging DuckDB as the in-memory query engine, and with Arrow as the backend.

Some examples (in terms of analytical query) can be found here:
https://delta-io.github.io/delta-rs/python/usage.html#querying-delta-tables

SajidAlamQB

Thanks for your contribution @afaqueahmad7117! Your DeltaTableDataSet class looks great.

afaqueahmad7117 · 2023-07-21T10:57:06Z

Thanks for your contribution @afaqueahmad7117! Your DeltaTableDataSet class looks great.

Thank you! @everdark too!

noklam · 2023-07-21T11:10:18Z

Thank you both! @afaqueahmad7117 @everdark, I will take over from now to fix the conflict and merge.

Signed-off-by: Nok <[email protected]>

noklam · 2023-07-21T11:18:49Z

Moving requirements around as we moved from setup.pu -> pyproject.toml

astrojuanlu · 2023-07-21T12:05:51Z

Let's get this merged 🚀 thanks @afaqueahmad7117 @everdark and team!

* feat: added delta table dataset Signed-off-by: Afaque Ahmad <[email protected]> * test: lint Signed-off-by: k9 <[email protected]> * chore: adjusted docstring line length Signed-off-by: Afaque Ahmad <[email protected]> * chore: fix requirements order Signed-off-by: k9 <[email protected]> * chore: add .pylintrc to ignore line too long for url Signed-off-by: k9 <[email protected]> * chore: remove invalid noqa comment Signed-off-by: k9 <[email protected]> * fix: do not import TableNotFoundError from private module to avoid pylint complains Signed-off-by: k9 <[email protected]> * fix: fixed linting issues Signed-off-by: Afaque Ahmad <[email protected]> * Move pylintrc to pyproject.toml Signed-off-by: Nok <[email protected]> * Fix pylint config Signed-off-by: Nok <[email protected]> * test: use mocker fixture to replace unittest.mock Signed-off-by: k9 <[email protected]> * chore: lint for line too long Signed-off-by: k9 <[email protected]> * test: increase coverage for pandas delta table dataset Signed-off-by: Kyle Chung <[email protected]> * chore: lint Signed-off-by: Kyle Chung <[email protected]> --------- Signed-off-by: Afaque Ahmad <[email protected]> Signed-off-by: k9 <[email protected]> Signed-off-by: Afaque Ahmad <[email protected]> Signed-off-by: Nok <[email protected]> Signed-off-by: Kyle Chung <[email protected]> Co-authored-by: k9 <[email protected]> Co-authored-by: Nok <[email protected]>

feat: added delta table dataset

4f216e5

Signed-off-by: Afaque Ahmad <[email protected]>

afaqueahmad7117 mentioned this pull request Jun 19, 2023

feat(datasets): Add DeltaTableDataSet #242

Closed

4 tasks

test: lint

2cb5255

Signed-off-by: k9 <[email protected]>

afaqueahmad7117 added 2 commits June 19, 2023 15:47

chore: adjusted docstring line length

e14f2c7

Signed-off-by: Afaque Ahmad <[email protected]>

Merge branch 'feat/non-spark-delta-dataset' of https://github.com/afa…

96f3c28

…queahmad7117/kedro-plugins into feat/non-spark-delta-dataset Signed-off-by: Afaque Ahmad <[email protected]>

afaqueahmad7117 marked this pull request as ready for review June 20, 2023 05:44

everdark and others added 5 commits June 20, 2023 16:23

chore: fix requirements order

4913f51

Signed-off-by: k9 <[email protected]>

chore: add .pylintrc to ignore line too long for url

72b9f11

Signed-off-by: k9 <[email protected]>

chore: remove invalid noqa comment

c768e3a

Signed-off-by: k9 <[email protected]>

fix: do not import TableNotFoundError from private module to avoid py…

57bf450

…lint complains Signed-off-by: k9 <[email protected]>

fix: fixed linting issues

82ad7e5

Signed-off-by: Afaque Ahmad <[email protected]>

noklam added the Community Issue/PR opened by the open-source community label Jun 20, 2023

noklam self-requested a review June 20, 2023 10:26

noklam added 2 commits June 20, 2023 10:40

Move pylintrc to pyproject.toml

2129eca

Signed-off-by: Nok <[email protected]>

Fix pylint config

f85449c

Signed-off-by: Nok <[email protected]>

noklam reviewed Jun 20, 2023

View reviewed changes

everdark added 2 commits June 21, 2023 10:35

test: use mocker fixture to replace unittest.mock

137fa5a

Signed-off-by: k9 <[email protected]>

chore: lint for line too long

0ddd544

Signed-off-by: k9 <[email protected]>

afaqueahmad7117 requested a review from noklam June 28, 2023 02:12

Merge branch 'main' into feat/non-spark-delta-dataset

bff8fdd

noklam requested review from astrojuanlu and datajoely June 28, 2023 10:30

datajoely reviewed Jun 28, 2023

View reviewed changes

kedro-datasets/kedro_datasets/pandas/deltatable_dataset.py Show resolved Hide resolved

noklam approved these changes Jun 28, 2023

View reviewed changes

Merge branch 'main' into feat/non-spark-delta-dataset

f626e83

Signed-off-by: Afaque Ahmad <[email protected]>

Merge branch 'main' into feat/non-spark-delta-dataset

7e2f50a

test: increase coverage for pandas delta table dataset

2b1c5e8

Signed-off-by: Kyle Chung <[email protected]>

chore: lint

021e995

Signed-off-by: Kyle Chung <[email protected]>

afaqueahmad7117 added 2 commits July 14, 2023 20:45

Merge branch 'main' into feat/non-spark-delta-dataset

c7c74f3

Merge branch 'main' into feat/non-spark-delta-dataset

1d449d7

deepyaman mentioned this pull request Jul 19, 2023

[DRAFT] Separate file format from processing engine in datasets #273

Open

SajidAlamQB approved these changes Jul 21, 2023

View reviewed changes

Merge branch 'main' into feat/non-spark-delta-dataset

85dd6c7

Signed-off-by: Nok <[email protected]>

noklam merged commit 8fb01ef into kedro-org:main Jul 21, 2023

astrojuanlu mentioned this pull request Feb 7, 2024

[spike] Clarify status of various Delta Table datasets #542

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(datasets): Added `DeltaTableDataSet` #243

feat(datasets): Added `DeltaTableDataSet` #243

afaqueahmad7117 commented Jun 19, 2023 •

edited

Loading

everdark commented Jun 19, 2023 •

edited

Loading

noklam commented Jun 19, 2023

everdark commented Jun 19, 2023

afaqueahmad7117 commented Jun 20, 2023

noklam commented Jun 20, 2023

astrojuanlu commented Jul 13, 2023

everdark commented Jul 13, 2023

astrojuanlu commented Jul 13, 2023

everdark commented Jul 13, 2023 •

edited

Loading

noklam commented Jul 18, 2023

everdark commented Jul 18, 2023 •

edited

Loading

SajidAlamQB left a comment

afaqueahmad7117 commented Jul 21, 2023 •

edited

Loading

noklam commented Jul 21, 2023

noklam commented Jul 21, 2023

astrojuanlu commented Jul 21, 2023

feat(datasets): Added DeltaTableDataSet #243

feat(datasets): Added DeltaTableDataSet #243

Conversation

afaqueahmad7117 commented Jun 19, 2023 • edited Loading

Description

Development notes

Checklist

everdark commented Jun 19, 2023 • edited Loading

noklam commented Jun 19, 2023

everdark commented Jun 19, 2023

afaqueahmad7117 commented Jun 20, 2023

noklam commented Jun 20, 2023

astrojuanlu commented Jul 13, 2023

everdark commented Jul 13, 2023

astrojuanlu commented Jul 13, 2023

everdark commented Jul 13, 2023 • edited Loading

noklam commented Jul 18, 2023

everdark commented Jul 18, 2023 • edited Loading

SajidAlamQB left a comment

Choose a reason for hiding this comment

afaqueahmad7117 commented Jul 21, 2023 • edited Loading

noklam commented Jul 21, 2023

noklam commented Jul 21, 2023

astrojuanlu commented Jul 21, 2023

feat(datasets): Added `DeltaTableDataSet` #243

feat(datasets): Added `DeltaTableDataSet` #243

afaqueahmad7117 commented Jun 19, 2023 •

edited

Loading

everdark commented Jun 19, 2023 •

edited

Loading

everdark commented Jul 13, 2023 •

edited

Loading

everdark commented Jul 18, 2023 •

edited

Loading

afaqueahmad7117 commented Jul 21, 2023 •

edited

Loading