-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(datasets): Add NetCDFDataSet class #360
Conversation
log.info("Syncing remote NetCDF file to local storage.") | ||
|
||
# `get_filepath_str` drops remote protocol prefix. | ||
load_path = self._protocol + "://" + load_path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested with AWS and GCS, but not sure if this is generalized enough. get_filepath_str
consistently drops the prefix of URI's from object storage, which is problematic for the subsequent fs.get
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah filepath mangling in fsspec is tricky... we have a related open issue about this kedro-org/kedro#3196
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for opening this PR as contribution @riley-brady ! I had a quick look and left some minor comments. The main being that we've now renamed all our datasets to end in Dataset
instead of DataSet
, so that's something that needs to be changed here.
Would it be helpful if I post in our Slack channel to see if there's more interest from the community for this dataset?
Thanks so much @merelcht! There's a lot of interest at #165 and I slacked with @astrojuanlu. It would be great if you posted on slack as well. I have added your feedback and finished out and tested locally the PR. Going to add unit testing now and it should be ready for a thorough review. I'll update on that issue thread when the tests are added. |
@merelcht the PR is fully implemented with testing and is ready for final review. |
Signed-off-by: Riley Brady <[email protected]>
Signed-off-by: Riley Brady <[email protected]>
Signed-off-by: Riley Brady <[email protected]>
Signed-off-by: Riley Brady <[email protected]>
Signed-off-by: Riley Brady <[email protected]>
Signed-off-by: Riley Brady <[email protected]>
Signed-off-by: Riley Brady <[email protected]>
Signed-off-by: Riley Brady <[email protected]>
Signed-off-by: Merel Theisen <[email protected]> Signed-off-by: Riley Brady <[email protected]>
Signed-off-by: Riley Brady <[email protected]>
Signed-off-by: Riley Brady <[email protected]>
Signed-off-by: Riley Brady <[email protected]>
Signed-off-by: Riley Brady <[email protected]>
Signed-off-by: Riley Brady <[email protected]>
Signed-off-by: Riley Brady <[email protected]>
Signed-off-by: Riley Brady <[email protected]>
* feat(datasets): create custom `DeprecationWarning` Signed-off-by: Deepyaman Datta <[email protected]> * feat(datasets): use the custom deprecation warning Signed-off-by: Deepyaman Datta <[email protected]> * chore(datasets): show Kedro's deprecation warnings Signed-off-by: Deepyaman Datta <[email protected]> * fix(datasets): remove unused imports in test files Signed-off-by: Deepyaman Datta <[email protected]> --------- Signed-off-by: Deepyaman Datta <[email protected]> Signed-off-by: Riley Brady <[email protected]>
I spent a decent chunk of time today trying to understand what's breaking in the credentials test, and I still don't know what's going on. That test is literally copy-pasted from It's not clear at all what the test is doing though. The "Failed while loading data from data set NetCDFDataset" is hiding a quite puzzling error, that can be seen by enabling the logs:
At what point in the stack is the Other datasets take a different approach and instead create a fake bucket, for example Spark and Video datasets. @pytest.fixture
def mocked_s3_bucket():
"""Create a bucket for testing using moto."""
with mock_s3():
conn = boto3.client(
"s3",
region_name="us-east-1",
aws_access_key_id=AWS_CREDENTIALS["key"],
aws_secret_access_key=AWS_CREDENTIALS["secret"],
)
conn.create_bucket(Bucket=S3_BUCKET_NAME)
yield conn This PR has been around for 4 months and is almost there. Is there any reasonable way we can take over from @riley-brady and get it merged? |
Signed-off-by: Ankita Katiyar <[email protected]>
Well, and now I realized that it was marked as @ankatiyar has done a fair share of debugging as well last week. I'm marking the one on NetCDF as xfail too. |
Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>
Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>
yeah the tests are broken. I'm taking this to another branch to do some debugging. |
Thanks for all the work on this, everyone. |
I'd like to echo that sentiment. As I PyMC user I often produce NetCDF files from xarrays. I really appreciate the efforts that have gone into this so far, and I am looking forward to PyMC integrating better with Kedro. :) |
@galenseilis what would you like to see integrated better with kedro particularly? |
I just meant the existence of |
For transparency on why this PR isn't merged yet: there's some issues with the tests, they seem to just hang and it's not clear why. @ankatiyar is currently investigating the issue and finding a way to maybe re-write the tests. Thank you so much for your work and patience @riley-brady, the Kedro team will make sure this gets merged in asap. |
Signed-off-by: Ankita Katiyar <[email protected]>
Signed-off-by: Ankita Katiyar <[email protected]>
I've made some changes and skipped the tests that load the dataset from s3 because those are getting stuck. |
# `get_filepath_str` drops remote protocol prefix. | ||
save_path = self._protocol + "://" + save_path | ||
|
||
save_path = self._filepath |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love that this logic is simpler now 💯
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left only 1 question @ankatiyar , if it's okay proceed and merge
Signed-off-by: Ankita Katiyar <[email protected]>
Thanks everyone!! |
* initialize template and early additions Signed-off-by: Riley Brady <[email protected]> * add placeholder for remote file system load Signed-off-by: Riley Brady <[email protected]> * switch to versioned dataset Signed-off-by: Riley Brady <[email protected]> * add initial remote -> local get for S3 Signed-off-by: Riley Brady <[email protected]> * further generalize remote retrieval Signed-off-by: Riley Brady <[email protected]> * add in credentials Signed-off-by: Riley Brady <[email protected]> * make temppath optional for remote datasets Signed-off-by: Riley Brady <[email protected]> * add initial idea for multifile glob Signed-off-by: Riley Brady <[email protected]> * style: Introduce `ruff` for linting in all plugins. (kedro-org#354) Signed-off-by: Merel Theisen <[email protected]> Signed-off-by: Riley Brady <[email protected]> * add suggested style changes Signed-off-by: Riley Brady <[email protected]> * add temppath to attributes Signed-off-by: Riley Brady <[email protected]> * more temppath fixes Signed-off-by: Riley Brady <[email protected]> * more temppath updates Signed-off-by: Riley Brady <[email protected]> * add better tempfile deletion and work on saving files Signed-off-by: Riley Brady <[email protected]> * make __del__ flexible Signed-off-by: Riley Brady <[email protected]> * formatting Signed-off-by: Riley Brady <[email protected]> * feat(datasets): create custom `DeprecationWarning` (kedro-org#356) * feat(datasets): create custom `DeprecationWarning` Signed-off-by: Deepyaman Datta <[email protected]> * feat(datasets): use the custom deprecation warning Signed-off-by: Deepyaman Datta <[email protected]> * chore(datasets): show Kedro's deprecation warnings Signed-off-by: Deepyaman Datta <[email protected]> * fix(datasets): remove unused imports in test files Signed-off-by: Deepyaman Datta <[email protected]> --------- Signed-off-by: Deepyaman Datta <[email protected]> Signed-off-by: Riley Brady <[email protected]> * docs(datasets): add note about DataSet deprecation (kedro-org#357) Signed-off-by: Riley Brady <[email protected]> * test(datasets): skip `tensorflow` tests on Windows (kedro-org#363) Signed-off-by: Deepyaman Datta <[email protected]> Signed-off-by: Riley Brady <[email protected]> * ci: Pin `tables` version (kedro-org#370) * Pin tables version Signed-off-by: Ankita Katiyar <[email protected]> * Also fix kedro-airflow Signed-off-by: Ankita Katiyar <[email protected]> * Revert trying to fix airflow Signed-off-by: Ankita Katiyar <[email protected]> --------- Signed-off-by: Ankita Katiyar <[email protected]> Signed-off-by: Riley Brady <[email protected]> * build(datasets): Release `1.7.1` (kedro-org#378) Signed-off-by: Merel Theisen <[email protected]> Signed-off-by: Riley Brady <[email protected]> * docs: Update CONTRIBUTING.md and add one for `kedro-datasets` (kedro-org#379) Update CONTRIBUTING.md + add one for kedro-datasets Signed-off-by: Ankita Katiyar <[email protected]> Signed-off-by: Riley Brady <[email protected]> * ci(datasets): Run tensorflow tests separately from other dataset tests (kedro-org#377) Signed-off-by: Merel Theisen <[email protected]> Signed-off-by: Riley Brady <[email protected]> * feat: Kedro-Airflow convert all pipelines option (kedro-org#335) * feat: kedro airflow convert --all option Signed-off-by: Simon Brugman <[email protected]> * docs: release docs Signed-off-by: Simon Brugman <[email protected]> --------- Signed-off-by: Simon Brugman <[email protected]> Signed-off-by: Riley Brady <[email protected]> * docs(datasets): blacken code in rst literal blocks (kedro-org#362) Signed-off-by: Deepyaman Datta <[email protected]> Signed-off-by: Riley Brady <[email protected]> * docs: cloudpickle is an interesting extension of the pickle functionality (kedro-org#361) Signed-off-by: H. Felix Wittmann <[email protected]> Signed-off-by: Riley Brady <[email protected]> * fix(datasets): Fix secret scan entropy error (kedro-org#383) Fix secret scan entropy error Signed-off-by: Merel Theisen <[email protected]> Signed-off-by: Riley Brady <[email protected]> * style: Rename mentions of `DataSet` to `Dataset` in `kedro-airflow` and `kedro-telemetry` (kedro-org#384) Signed-off-by: Merel Theisen <[email protected]> Signed-off-by: Riley Brady <[email protected]> * feat(datasets): Migrated `PartitionedDataSet` and `IncrementalDataSet` from main repository to kedro-datasets (kedro-org#253) Signed-off-by: Peter Bludau <[email protected]> Co-authored-by: Merel Theisen <[email protected]> Signed-off-by: Riley Brady <[email protected]> * fix: backwards compatibility for `kedro-airflow` (kedro-org#381) Signed-off-by: Simon Brugman <[email protected]> Signed-off-by: Riley Brady <[email protected]> * fix(datasets): Don't warn for SparkDataset on Databricks when using s3 (kedro-org#341) Signed-off-by: Alistair McKelvie <[email protected]> Signed-off-by: Riley Brady <[email protected]> * update docs API and release notes Signed-off-by: Riley Brady <[email protected]> * add netcdf requirements to setup Signed-off-by: Riley Brady <[email protected]> * lint Signed-off-by: Riley Brady <[email protected]> * add initial tests Signed-off-by: Riley Brady <[email protected]> * update dataset exists for multifile Signed-off-by: Riley Brady <[email protected]> * Add full test suite for NetCDFDataSet Signed-off-by: Riley Brady <[email protected]> * Add docstring examples Signed-off-by: Riley Brady <[email protected]> * change xarray version req Signed-off-by: Riley Brady <[email protected]> * update dask req Signed-off-by: Riley Brady <[email protected]> * rename DataSet -> Dataset Signed-off-by: Riley Brady <[email protected]> * Update xarray reqs for earlier python versions Signed-off-by: Riley Brady <[email protected]> * fix setup Signed-off-by: Riley Brady <[email protected]> * update test coverage Signed-off-by: Riley Brady <[email protected]> * exclude init from test coverage Signed-off-by: Riley Brady <[email protected]> * Sub in pathlib for os.remove Signed-off-by: Riley Brady <[email protected]> * add metadata to dataset Signed-off-by: Riley Brady <[email protected]> * add doctest for the new datasets Signed-off-by: Nok <[email protected]> * add patch for supporting http/https Signed-off-by: Riley Brady <[email protected]> * Small fixes post-merge Signed-off-by: Juan Luis Cano Rodríguez <[email protected]> * Lint Signed-off-by: Juan Luis Cano Rodríguez <[email protected]> * Fix import Signed-off-by: Juan Luis Cano Rodríguez <[email protected]> * Un-ignore NetCDF doctest Signed-off-by: Juan Luis Cano Rodríguez <[email protected]> * Add fixture Signed-off-by: Ankita Katiyar <[email protected]> * Mark problematic test as xfail Signed-off-by: Juan Luis Cano Rodríguez <[email protected]> * Skip problematic test instead of making it fail Signed-off-by: Juan Luis Cano Rodríguez <[email protected]> * Skip problematic tests and fix failing tests Signed-off-by: Ankita Katiyar <[email protected]> * Remove comment Signed-off-by: Ankita Katiyar <[email protected]> --------- Signed-off-by: Riley Brady <[email protected]> Signed-off-by: Merel Theisen <[email protected]> Signed-off-by: Deepyaman Datta <[email protected]> Signed-off-by: Ankita Katiyar <[email protected]> Signed-off-by: Simon Brugman <[email protected]> Signed-off-by: H. Felix Wittmann <[email protected]> Signed-off-by: Peter Bludau <[email protected]> Signed-off-by: Alistair McKelvie <[email protected]> Signed-off-by: Merel Theisen <[email protected]> Signed-off-by: Nok Lam Chan <[email protected]> Signed-off-by: Nok <[email protected]> Signed-off-by: Juan Luis Cano Rodríguez <[email protected]> Signed-off-by: Ankita Katiyar <[email protected]> Co-authored-by: Merel Theisen <[email protected]> Co-authored-by: Deepyaman Datta <[email protected]> Co-authored-by: Ankita Katiyar <[email protected]> Co-authored-by: Simon Brugman <[email protected]> Co-authored-by: Felix Wittmann <[email protected]> Co-authored-by: PtrBld <[email protected]> Co-authored-by: Merel Theisen <[email protected]> Co-authored-by: Alistair McKelvie <[email protected]> Co-authored-by: Nok Lam Chan <[email protected]> Co-authored-by: Juan Luis Cano Rodríguez <[email protected]> Co-authored-by: Ankita Katiyar <[email protected]> Signed-off-by: tgoelles <[email protected]>
Description
There's a large geoscience/astrophysics and beyond community that leverage NetCDF for datasets that are stored with structured coordinates and metadata. A massive trove of the existing climate and weather data exists as NetCDF (.nc) files, for example.
See #165.
Development notes
I'd like to get this implemented with file syncing for load-from-remote, since it's most straight-forward. A future PR could work with kerchunk to allow direct loading from remote storage. This is a really nice toolkit, but requires management of a lot of JSON metadata files that are generated, and which sometimes can be quite slow to generate. It will take a little bit of tweaking to implement this nicely, since the first run would need to generate and cache/store all of the reference JSONs to make future loads much much faster.
Checklist
RELEASE.md
file