-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(datasets): Added DeltaTableDataSet
#243
feat(datasets): Added DeltaTableDataSet
#243
Conversation
Signed-off-by: Afaque Ahmad <[email protected]>
Signed-off-by: k9 <[email protected]>
Just a side note, there is a bug for the delta-rs python binding for using catalog: delta-io/delta-rs#1466; so right now I'm not able to smoke test the |
Signed-off-by: Afaque Ahmad <[email protected]>
…queahmad7117/kedro-plugins into feat/non-spark-delta-dataset Signed-off-by: Afaque Ahmad <[email protected]>
@everdark Is this ready to be review now? I saw you have closed the old PR but this one is still in draft. |
Depending on whether @afaqueahmad7117 is doing some more works. I don't have anything to add from my side as of now. We close the old PR cause there are some mess-up in the commits. |
Signed-off-by: k9 <[email protected]>
Signed-off-by: k9 <[email protected]>
Signed-off-by: k9 <[email protected]>
…lint complains Signed-off-by: k9 <[email protected]>
Signed-off-by: Afaque Ahmad <[email protected]>
Signed-off-by: Nok <[email protected]>
Signed-off-by: Nok <[email protected]>
Noted that the CI is failing for some other reason, kedro-org/kedro#2673 would fix this, so ignore the irrelevant error or try to install a slightly older kedro version to test it locally. |
Signed-off-by: k9 <[email protected]>
Signed-off-by: k9 <[email protected]>
Signed-off-by: Afaque Ahmad <[email protected]>
The new dataset does not yet have 100 % coverage, some lines are missing:
|
Signed-off-by: Kyle Chung <[email protected]>
Coverage increased to 100% in 2b1c5e8. |
Signed-off-by: Kyle Chung <[email protected]>
Thanks a lot folks, this is looking really good! Just to help me contextualize and sorry for the basic question: but could you summarize what is the difference between this |
The By adding this into
The user should decide which one best suits their need. |
Agree, although I think the current space is still dominated by Parquet for data processing, Delta files are usually larger due to the compression and version history. IMO the versioning features are quite important and it deserves wider adoption outside of the Spark ecosystem. I have no idea how the compacting and re-partitioning works with non-spark implementation? This feels like the responsibility of some kind of DB or data processing engine, it's probably too much for the Dataset abstraction. WDYT? |
It will be mostly based on Apache Arrow. It will make sure we don't need to load all files into memory. Some examples (in terms of analytical query) can be found here: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution @afaqueahmad7117! Your DeltaTableDataSet
class looks great.
Thank you! @everdark too! |
Thank you both! @afaqueahmad7117 @everdark, I will take over from now to fix the conflict and merge. |
Signed-off-by: Nok <[email protected]>
Moving requirements around as we moved from setup.pu -> pyproject.toml |
Let's get this merged 🚀 thanks @afaqueahmad7117 @everdark and team! |
* feat: added delta table dataset Signed-off-by: Afaque Ahmad <[email protected]> * test: lint Signed-off-by: k9 <[email protected]> * chore: adjusted docstring line length Signed-off-by: Afaque Ahmad <[email protected]> * chore: fix requirements order Signed-off-by: k9 <[email protected]> * chore: add .pylintrc to ignore line too long for url Signed-off-by: k9 <[email protected]> * chore: remove invalid noqa comment Signed-off-by: k9 <[email protected]> * fix: do not import TableNotFoundError from private module to avoid pylint complains Signed-off-by: k9 <[email protected]> * fix: fixed linting issues Signed-off-by: Afaque Ahmad <[email protected]> * Move pylintrc to pyproject.toml Signed-off-by: Nok <[email protected]> * Fix pylint config Signed-off-by: Nok <[email protected]> * test: use mocker fixture to replace unittest.mock Signed-off-by: k9 <[email protected]> * chore: lint for line too long Signed-off-by: k9 <[email protected]> * test: increase coverage for pandas delta table dataset Signed-off-by: Kyle Chung <[email protected]> * chore: lint Signed-off-by: Kyle Chung <[email protected]> --------- Signed-off-by: Afaque Ahmad <[email protected]> Signed-off-by: k9 <[email protected]> Signed-off-by: Afaque Ahmad <[email protected]> Signed-off-by: Nok <[email protected]> Signed-off-by: Kyle Chung <[email protected]> Co-authored-by: k9 <[email protected]> Co-authored-by: Nok <[email protected]>
Description
This PR adds
DeltaTableDataSet
to the Kedro datasets plugin. Issue 226.Development notes
DeltaTableDataSet
herekedro-datasets/kedro_datasets/pandas/deltatable_dataset.py
kedro-datasets/tests/pandas/test_deltatable_dataset.py
Checklist
RELEASE.md
file