Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(datasets): Add NetCDFDataSet class #360

Merged
merged 68 commits into from
Feb 28, 2024

Conversation

riley-brady
Copy link
Contributor

@riley-brady riley-brady commented Sep 29, 2023

Description

There's a large geoscience/astrophysics and beyond community that leverage NetCDF for datasets that are stored with structured coordinates and metadata. A massive trove of the existing climate and weather data exists as NetCDF (.nc) files, for example.

See #165.

Development notes

I'd like to get this implemented with file syncing for load-from-remote, since it's most straight-forward. A future PR could work with kerchunk to allow direct loading from remote storage. This is a really nice toolkit, but requires management of a lot of JSON metadata files that are generated, and which sometimes can be quite slow to generate. It will take a little bit of tweaking to implement this nicely, since the first run would need to generate and cache/store all of the reference JSONs to make future loads much much faster.

Checklist

  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the relevant RELEASE.md file
  • Added tests to cover my changes

@riley-brady riley-brady reopened this Sep 29, 2023
@riley-brady riley-brady marked this pull request as draft September 29, 2023 19:06
@riley-brady riley-brady changed the title Add NetCDF and Zarr datasets Add NetCDF dataset Oct 2, 2023
@riley-brady riley-brady marked this pull request as ready for review October 2, 2023 21:13
log.info("Syncing remote NetCDF file to local storage.")

# `get_filepath_str` drops remote protocol prefix.
load_path = self._protocol + "://" + load_path
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested with AWS and GCS, but not sure if this is generalized enough. get_filepath_str consistently drops the prefix of URI's from object storage, which is problematic for the subsequent fs.get

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah filepath mangling in fsspec is tricky... we have a related open issue about this kedro-org/kedro#3196

Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for opening this PR as contribution @riley-brady ! I had a quick look and left some minor comments. The main being that we've now renamed all our datasets to end in Dataset instead of DataSet, so that's something that needs to be changed here.

Would it be helpful if I post in our Slack channel to see if there's more interest from the community for this dataset?

kedro-datasets/kedro_datasets/netcdf/netcdf_dataset.py Outdated Show resolved Hide resolved
kedro-datasets/kedro_datasets/netcdf/netcdf_dataset.py Outdated Show resolved Hide resolved
kedro-datasets/kedro_datasets/netcdf/netcdf_dataset.py Outdated Show resolved Hide resolved
@merelcht merelcht mentioned this pull request Oct 11, 2023
3 tasks
@riley-brady
Copy link
Contributor Author

Would it be helpful if I post in our Slack channel to see if there's more interest from the community for this dataset?

Thanks so much @merelcht! There's a lot of interest at #165 and I slacked with @astrojuanlu. It would be great if you posted on slack as well.

I have added your feedback and finished out and tested locally the PR. Going to add unit testing now and it should be ready for a thorough review. I'll update on that issue thread when the tests are added.

@riley-brady riley-brady mentioned this pull request Oct 13, 2023
@riley-brady
Copy link
Contributor Author

@merelcht the PR is fully implemented with testing and is ready for final review.

riley-brady and others added 17 commits October 15, 2023 12:21
Signed-off-by: Riley Brady <[email protected]>
Signed-off-by: Riley Brady <[email protected]>
Signed-off-by: Riley Brady <[email protected]>
Signed-off-by: Riley Brady <[email protected]>
Signed-off-by: Riley Brady <[email protected]>
Signed-off-by: Riley Brady <[email protected]>
Signed-off-by: Riley Brady <[email protected]>
Signed-off-by: Riley Brady <[email protected]>
* feat(datasets): create custom `DeprecationWarning`

Signed-off-by: Deepyaman Datta <[email protected]>

* feat(datasets): use the custom deprecation warning

Signed-off-by: Deepyaman Datta <[email protected]>

* chore(datasets): show Kedro's deprecation warnings

Signed-off-by: Deepyaman Datta <[email protected]>

* fix(datasets): remove unused imports in test files

Signed-off-by: Deepyaman Datta <[email protected]>

---------

Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Riley Brady <[email protected]>
@astrojuanlu
Copy link
Member

I spent a decent chunk of time today trying to understand what's breaking in the credentials test, and I still don't know what's going on. That test is literally copy-pasted from test_parquet_dataset.py, and yet it fails.

It's not clear at all what the test is doing though. The "Failed while loading data from data set NetCDFDataset" is hiding a quite puzzling error, that can be seen by enabling the logs:

$ pushd kedro-datasets && pytest -s --log-cli-level=DEBUG -vvv -k "pass_credentials and netcdf" --pdb; popd
...
DEBUG    s3fs:core.py:337 CALL: get_object - () - {'Bucket': 'test_bucket', 'Key': 'test.nc'}
                    DEBUG    Event before-parameter-build.s3.GetObject: calling handler <function sse_md5 at 0x136c36670>                                                                                                                                                                                    hooks.py:63
...
[02/15/24 14:42:42] DEBUG    Response headers: HTTPHeaderDict({'x-amz-request-id': 'C36M0ESFJX55TN4N', 'x-amz-id-2': 'b4c9yMZKWWSis5iwwE8suoOo3o5jO6LF/xxDnZqGFZwOinzNj+N+TIBHQQ3DaAc1Z6YkwPOo83I=', 'content-type': 'application/xml', 'transfer-encoding': 'chunked', 'date': 'Thu, 15 Feb 2024         parsers.py:239
                             13:42:42 GMT', 'server': 'AmazonS3'})                                                                                                                                                                                                                                                      
DEBUG    botocore.parsers:parsers.py:239 Response headers: HTTPHeaderDict({'x-amz-request-id': 'C36M0ESFJX55TN4N', 'x-amz-id-2': 'b4c9yMZKWWSis5iwwE8suoOo3o5jO6LF/xxDnZqGFZwOinzNj+N+TIBHQQ3DaAc1Z6YkwPOo83I=', 'content-type': 'application/xml', 'transfer-encoding': 'chunked', 'date': 'Thu, 15 Feb 2024 13:42:42 GMT', 'server': 'AmazonS3'})
                    DEBUG    Response body:                                                                                                                                                                                                                                                               parsers.py:240
                             <coroutine object AioAWSResponse._content_prop at 0x2bdbd1f40>                                                                                                                                                                                                                             
DEBUG    botocore.parsers:parsers.py:240 Response body:
<coroutine object AioAWSResponse._content_prop at 0x2bdbd1f40>
                    DEBUG    Nonretryable error: a bytes-like object is required, not 'coroutine'                                                                                                                                                                                                            core.py:125
DEBUG    s3fs:core.py:125 Nonretryable error: a bytes-like object is required, not 'coroutine'
FAILED

At what point in the stack is the create_client method supposed to be called, I have no idea.

Other datasets take a different approach and instead create a fake bucket, for example Spark and Video datasets.

@pytest.fixture
def mocked_s3_bucket():
    """Create a bucket for testing using moto."""
    with mock_s3():
        conn = boto3.client(
            "s3",
            region_name="us-east-1",
            aws_access_key_id=AWS_CREDENTIALS["key"],
            aws_secret_access_key=AWS_CREDENTIALS["secret"],
        )
        conn.create_bucket(Bucket=S3_BUCKET_NAME)
        yield conn

This PR has been around for 4 months and is almost there. Is there any reasonable way we can take over from @riley-brady and get it merged?

Signed-off-by: Ankita Katiyar <[email protected]>
@astrojuanlu
Copy link
Member

That test is literally copy-pasted from test_parquet_dataset.py, and yet it fails.

Well, and now I realized that it was marked as xfail in #463 and #459 after @merelcht spent one week trying to understand what to do with it. We just fell in the same trap again...

@ankatiyar has done a fair share of debugging as well last week.

I'm marking the one on NetCDF as xfail too.

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>
@astrojuanlu
Copy link
Member

yeah the tests are broken. I'm taking this to another branch to do some debugging.

@riley-brady
Copy link
Contributor Author

Thanks for all the work on this, everyone.

@galenseilis
Copy link
Contributor

Thanks for all the work on this, everyone.

I'd like to echo that sentiment. As I PyMC user I often produce NetCDF files from xarrays. I really appreciate the efforts that have gone into this so far, and I am looking forward to PyMC integrating better with Kedro. :)

@noklam
Copy link
Contributor

noklam commented Feb 26, 2024

@galenseilis what would you like to see integrated better with kedro particularly?

@galenseilis
Copy link
Contributor

@galenseilis what would you like to see integrated better with kedro particularly?

I just meant the existence of NetCDFDataSet. NetCDF is the conventional choice for saving PyMC modelling results. I'm excited about that! 🤩

@merelcht
Copy link
Member

For transparency on why this PR isn't merged yet: there's some issues with the tests, they seem to just hang and it's not clear why. @ankatiyar is currently investigating the issue and finding a way to maybe re-write the tests. Thank you so much for your work and patience @riley-brady, the Kedro team will make sure this gets merged in asap.

@ankatiyar
Copy link
Contributor

I've made some changes and skipped the tests that load the dataset from s3 because those are getting stuck.

# `get_filepath_str` drops remote protocol prefix.
save_path = self._protocol + "://" + save_path

save_path = self._filepath
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love that this logic is simpler now 💯

Copy link
Member

@astrojuanlu astrojuanlu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left only 1 question @ankatiyar , if it's okay proceed and merge

Signed-off-by: Ankita Katiyar <[email protected]>
@astrojuanlu astrojuanlu enabled auto-merge (squash) February 28, 2024 16:22
@astrojuanlu astrojuanlu merged commit 16dfbdd into kedro-org:main Feb 28, 2024
12 checks passed
@riley-brady
Copy link
Contributor Author

Thanks everyone!!

tgoelles pushed a commit to tgoelles/kedro-plugins that referenced this pull request Jun 6, 2024
* initialize template and early additions

Signed-off-by: Riley Brady <[email protected]>

* add placeholder for remote file system load

Signed-off-by: Riley Brady <[email protected]>

* switch to versioned dataset

Signed-off-by: Riley Brady <[email protected]>

* add initial remote -> local get for S3

Signed-off-by: Riley Brady <[email protected]>

* further generalize remote retrieval

Signed-off-by: Riley Brady <[email protected]>

* add in credentials

Signed-off-by: Riley Brady <[email protected]>

* make temppath optional for remote datasets

Signed-off-by: Riley Brady <[email protected]>

* add initial idea for multifile glob

Signed-off-by: Riley Brady <[email protected]>

* style: Introduce `ruff` for linting in all plugins. (kedro-org#354)

Signed-off-by: Merel Theisen <[email protected]>
Signed-off-by: Riley Brady <[email protected]>

* add suggested style changes

Signed-off-by: Riley Brady <[email protected]>

* add temppath to attributes

Signed-off-by: Riley Brady <[email protected]>

* more temppath fixes

Signed-off-by: Riley Brady <[email protected]>

* more temppath updates

Signed-off-by: Riley Brady <[email protected]>

* add better tempfile deletion and work on saving files

Signed-off-by: Riley Brady <[email protected]>

* make __del__ flexible

Signed-off-by: Riley Brady <[email protected]>

* formatting

Signed-off-by: Riley Brady <[email protected]>

* feat(datasets): create custom `DeprecationWarning` (kedro-org#356)

* feat(datasets): create custom `DeprecationWarning`

Signed-off-by: Deepyaman Datta <[email protected]>

* feat(datasets): use the custom deprecation warning

Signed-off-by: Deepyaman Datta <[email protected]>

* chore(datasets): show Kedro's deprecation warnings

Signed-off-by: Deepyaman Datta <[email protected]>

* fix(datasets): remove unused imports in test files

Signed-off-by: Deepyaman Datta <[email protected]>

---------

Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Riley Brady <[email protected]>

* docs(datasets): add note about DataSet deprecation (kedro-org#357)

Signed-off-by: Riley Brady <[email protected]>

* test(datasets): skip `tensorflow` tests on Windows (kedro-org#363)

Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Riley Brady <[email protected]>

* ci: Pin `tables` version (kedro-org#370)

* Pin tables version

Signed-off-by: Ankita Katiyar <[email protected]>

* Also fix kedro-airflow

Signed-off-by: Ankita Katiyar <[email protected]>

* Revert trying to fix airflow

Signed-off-by: Ankita Katiyar <[email protected]>

---------

Signed-off-by: Ankita Katiyar <[email protected]>
Signed-off-by: Riley Brady <[email protected]>

* build(datasets): Release `1.7.1` (kedro-org#378)

Signed-off-by: Merel Theisen <[email protected]>
Signed-off-by: Riley Brady <[email protected]>

* docs: Update CONTRIBUTING.md and add one for `kedro-datasets` (kedro-org#379)

Update CONTRIBUTING.md + add one for kedro-datasets

Signed-off-by: Ankita Katiyar <[email protected]>
Signed-off-by: Riley Brady <[email protected]>

* ci(datasets): Run tensorflow tests separately from other dataset tests (kedro-org#377)

Signed-off-by: Merel Theisen <[email protected]>
Signed-off-by: Riley Brady <[email protected]>

* feat: Kedro-Airflow convert all pipelines option (kedro-org#335)

* feat: kedro airflow convert --all option

Signed-off-by: Simon Brugman <[email protected]>

* docs: release docs

Signed-off-by: Simon Brugman <[email protected]>

---------

Signed-off-by: Simon Brugman <[email protected]>
Signed-off-by: Riley Brady <[email protected]>

* docs(datasets): blacken code in rst literal blocks (kedro-org#362)

Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Riley Brady <[email protected]>

* docs: cloudpickle is an interesting extension of the pickle functionality (kedro-org#361)

Signed-off-by: H. Felix Wittmann <[email protected]>
Signed-off-by: Riley Brady <[email protected]>

* fix(datasets): Fix secret scan entropy error (kedro-org#383)

Fix secret scan entropy error

Signed-off-by: Merel Theisen <[email protected]>
Signed-off-by: Riley Brady <[email protected]>

* style: Rename mentions of `DataSet` to `Dataset` in `kedro-airflow` and `kedro-telemetry` (kedro-org#384)

Signed-off-by: Merel Theisen <[email protected]>
Signed-off-by: Riley Brady <[email protected]>

* feat(datasets): Migrated `PartitionedDataSet` and `IncrementalDataSet` from main repository to kedro-datasets (kedro-org#253)

Signed-off-by: Peter Bludau <[email protected]>
Co-authored-by: Merel Theisen <[email protected]>
Signed-off-by: Riley Brady <[email protected]>

* fix: backwards compatibility for `kedro-airflow` (kedro-org#381)

Signed-off-by: Simon Brugman <[email protected]>
Signed-off-by: Riley Brady <[email protected]>

* fix(datasets): Don't warn for SparkDataset on Databricks when using s3 (kedro-org#341)

Signed-off-by: Alistair McKelvie <[email protected]>
Signed-off-by: Riley Brady <[email protected]>

* update docs API and release notes

Signed-off-by: Riley Brady <[email protected]>

* add netcdf requirements to setup

Signed-off-by: Riley Brady <[email protected]>

* lint

Signed-off-by: Riley Brady <[email protected]>

* add initial tests

Signed-off-by: Riley Brady <[email protected]>

* update dataset exists for multifile

Signed-off-by: Riley Brady <[email protected]>

* Add full test suite for NetCDFDataSet

Signed-off-by: Riley Brady <[email protected]>

* Add docstring examples

Signed-off-by: Riley Brady <[email protected]>

* change xarray version req

Signed-off-by: Riley Brady <[email protected]>

* update dask req

Signed-off-by: Riley Brady <[email protected]>

* rename DataSet -> Dataset

Signed-off-by: Riley Brady <[email protected]>

* Update xarray reqs for earlier python versions

Signed-off-by: Riley Brady <[email protected]>

* fix setup

Signed-off-by: Riley Brady <[email protected]>

* update test coverage

Signed-off-by: Riley Brady <[email protected]>

* exclude init from test coverage

Signed-off-by: Riley Brady <[email protected]>

* Sub in pathlib for os.remove

Signed-off-by: Riley Brady <[email protected]>

* add metadata to dataset

Signed-off-by: Riley Brady <[email protected]>

* add doctest for the new datasets

Signed-off-by: Nok <[email protected]>

* add patch for supporting http/https

Signed-off-by: Riley Brady <[email protected]>

* Small fixes post-merge

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>

* Lint

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>

* Fix import

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>

* Un-ignore NetCDF doctest

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>

* Add fixture

Signed-off-by: Ankita Katiyar <[email protected]>

* Mark problematic test as xfail

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>

* Skip problematic test instead of making it fail

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>

* Skip problematic tests and fix failing tests

Signed-off-by: Ankita Katiyar <[email protected]>

* Remove comment

Signed-off-by: Ankita Katiyar <[email protected]>

---------

Signed-off-by: Riley Brady <[email protected]>
Signed-off-by: Merel Theisen <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Ankita Katiyar <[email protected]>
Signed-off-by: Simon Brugman <[email protected]>
Signed-off-by: H. Felix Wittmann <[email protected]>
Signed-off-by: Peter Bludau <[email protected]>
Signed-off-by: Alistair McKelvie <[email protected]>
Signed-off-by: Merel Theisen <[email protected]>
Signed-off-by: Nok Lam Chan <[email protected]>
Signed-off-by: Nok <[email protected]>
Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>
Signed-off-by: Ankita Katiyar <[email protected]>
Co-authored-by: Merel Theisen <[email protected]>
Co-authored-by: Deepyaman Datta <[email protected]>
Co-authored-by: Ankita Katiyar <[email protected]>
Co-authored-by: Simon Brugman <[email protected]>
Co-authored-by: Felix Wittmann <[email protected]>
Co-authored-by: PtrBld <[email protected]>
Co-authored-by: Merel Theisen <[email protected]>
Co-authored-by: Alistair McKelvie <[email protected]>
Co-authored-by: Nok Lam Chan <[email protected]>
Co-authored-by: Juan Luis Cano Rodríguez <[email protected]>
Co-authored-by: Ankita Katiyar <[email protected]>
Signed-off-by: tgoelles <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.