Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(datasets): Add rioxarray and RasterDataset #355

Merged
merged 238 commits into from
Jul 5, 2024
Merged
Show file tree
Hide file tree
Changes from 233 commits
Commits
Show all changes
238 commits
Select commit Hold shift + click to select a range
0bffa11
refactor(datasets): deprecate "DataSet" type names (#328)
deepyaman Sep 20, 2023
cc13fd6
added basic code for geotiff
tgoelles Sep 26, 2023
0a83de4
renamed to xarray
tgoelles Sep 27, 2023
406dffd
renamed to xarray
tgoelles Sep 27, 2023
9b32d79
added load and self args
tgoelles Sep 27, 2023
b2fc87f
only local files
tgoelles Sep 27, 2023
661eebb
added empty test
tgoelles Sep 27, 2023
4f6df5c
added test data
tgoelles Sep 27, 2023
eea4690
added rioxarray requirements
tgoelles Sep 28, 2023
1fa31c7
reformat with black
tgoelles Sep 28, 2023
2049431
rioxarray 0.14
tgoelles Sep 28, 2023
a9bd48f
rioxarray 0.15
tgoelles Sep 28, 2023
9c150ea
rioxarray 0.12
tgoelles Sep 28, 2023
7ba1506
rioxarray 0.9
tgoelles Sep 28, 2023
ca43610
fixed dataset typo
tgoelles Sep 28, 2023
f322147
fixed docstring for sphinx
tgoelles Sep 28, 2023
b01a018
run black
tgoelles Sep 28, 2023
461fd02
sort imports
tgoelles Sep 28, 2023
c6d54ea
class docstring
tgoelles Sep 28, 2023
7308b59
black
tgoelles Sep 28, 2023
4373360
fixed pylint
tgoelles Sep 28, 2023
d0a2832
added release notes
tgoelles Sep 28, 2023
427ea75
added yaml example
tgoelles Sep 28, 2023
c63fc1b
improve testing WIP
tgoelles Oct 5, 2023
897a8f5
basic test success
tgoelles Oct 5, 2023
dc47f76
test reloaded
tgoelles Oct 5, 2023
9541e25
test exists
tgoelles Oct 5, 2023
2a3e3ee
added version
tgoelles Oct 5, 2023
3e3569b
basic test suite
tgoelles Oct 5, 2023
e4710ee
run black
tgoelles Oct 5, 2023
cdf8e96
added example and test it
tgoelles Oct 12, 2023
6cf15e2
deleted duplications
tgoelles Oct 12, 2023
0616cea
fixed position of example
tgoelles Oct 12, 2023
bef94bb
black
tgoelles Oct 13, 2023
ed26c52
style: Introduce `ruff` for linting in all plugins. (#354)
merelcht Oct 2, 2023
ae568cc
feat(datasets): create custom `DeprecationWarning` (#356)
deepyaman Oct 2, 2023
7ee7c6d
docs(datasets): add note about DataSet deprecation (#357)
deepyaman Oct 3, 2023
1587eee
test(datasets): skip `tensorflow` tests on Windows (#363)
deepyaman Oct 4, 2023
c2f52f5
ci: Pin `tables` version (#370)
ankatiyar Oct 5, 2023
02aa70d
build(datasets): Release `1.7.1` (#378)
merelcht Oct 6, 2023
82ac8c9
docs: Update CONTRIBUTING.md and add one for `kedro-datasets` (#379)
ankatiyar Oct 6, 2023
132bf37
ci(datasets): Run tensorflow tests separately from other dataset test…
merelcht Oct 6, 2023
53f0fe5
feat: Kedro-Airflow convert all pipelines option (#335)
sbrugman Oct 9, 2023
727ed81
docs(datasets): blacken code in rst literal blocks (#362)
deepyaman Oct 10, 2023
f241464
docs: cloudpickle is an interesting extension of the pickle functiona…
hfwittmann Oct 10, 2023
f26406f
fix(datasets): Fix secret scan entropy error (#383)
merelcht Oct 11, 2023
b60a5f4
style: Rename mentions of `DataSet` to `Dataset` in `kedro-airflow` a…
merelcht Oct 11, 2023
1d3094c
feat(datasets): Migrated `PartitionedDataSet` and `IncrementalDataSet…
PtrBld Oct 11, 2023
9419821
fix: backwards compatibility for `kedro-airflow` (#381)
sbrugman Oct 12, 2023
bbdc89a
added metadata
tgoelles Oct 19, 2023
68b731a
after linting
tgoelles Oct 19, 2023
a43ef5b
ignore ruff PLR0913
tgoelles Oct 19, 2023
c4d2c18
fix(datasets): Don't warn for SparkDataset on Databricks when using s…
alamastor Oct 12, 2023
b41cd20
chore: Hot fix for RTD due to bad pip version (#396)
noklam Oct 17, 2023
2cd79bb
chore: Pin pip version temporarily (#398)
ankatiyar Oct 18, 2023
4214432
perf(datasets): don't create connection until need (#281)
deepyaman Oct 19, 2023
ba64102
chore: Drop Python 3.7 support for kedro-plugins (#392)
lrcouto Oct 19, 2023
a216b21
feat(datasets): support Polars lazy evaluation (#350)
MatthiasRoels Oct 20, 2023
58af6eb
build(datasets): Release `1.8.0` (#406)
merelcht Oct 24, 2023
8948d4a
build(airflow): Release 0.7.0 (#407)
ankatiyar Oct 24, 2023
f26b950
build(telemetry): Release 0.3.0 (#408)
ankatiyar Oct 24, 2023
72abd95
build(docker): Release 0.4.0 (#409)
ankatiyar Oct 24, 2023
f5d7e71
style(airflow): blacken README.md of Kedro-Airflow (#418)
deepyaman Oct 25, 2023
2940376
fix(datasets): Fix missing jQuery (#414)
astrojuanlu Oct 25, 2023
158056f
fix(datasets): Fix Lazy Polars dataset to use the new-style base clas…
astrojuanlu Oct 25, 2023
469f201
chore(datasets): lazily load `partitions` classes (#411)
deepyaman Oct 25, 2023
e855ed9
docs(datasets): fix code blocks and `data_set` use (#417)
deepyaman Oct 25, 2023
388192e
fix: TF model load failure when model is saved as a TensorFlow Saved …
Edouard59 Oct 26, 2023
48557d2
chore: Drop support for Python 3.7 on kedro-datasets (#419)
lrcouto Oct 27, 2023
e8946f7
test(datasets): run doctests to check examples run (#416)
deepyaman Oct 27, 2023
9bdeefb
feat(datasets): Add support for `databricks-connect>=13.0` (#352)
MigQ2 Nov 1, 2023
ffeef40
fix(telemetry): remove double execution by moving to after catalog cr…
fdroessler Nov 1, 2023
41c9e7e
docs: Add python version support policy to plugin `README.md`s (#425)
merelcht Nov 2, 2023
5ea5594
docs(airflow): Use new docs link (#393)
astrojuanlu Nov 10, 2023
f6d4f79
style: Add shared CSS and meganav to datasets docs (#400)
stichbury Nov 10, 2023
ecad5d5
feat(datasets): Add Hugging Face datasets (#344)
astrojuanlu Nov 13, 2023
f25ebb6
test(datasets): fix `dask.ParquetDataset` doctests (#439)
deepyaman Nov 22, 2023
da3f8c2
refactor: Remove `DataSet` aliases and mentions (#440)
merelcht Nov 24, 2023
bc02b27
chore(datasets): replace "Pyspark" with "PySpark" (#423)
deepyaman Nov 25, 2023
abfa97f
test(datasets): make `api.APIDataset` doctests run (#448)
deepyaman Nov 27, 2023
677d0a2
chore(datasets): Fix `pandas.GenericDataset` doctest (#445)
merelcht Nov 27, 2023
79f58a9
feat(datasets): make datasets arguments keywords only (#358)
felixscherz Nov 27, 2023
d7b0395
chore: Drop support for python 3.8 on kedro-datasets (#442)
DimedS Nov 27, 2023
bf4e1b7
test(datasets): add outputs to matplotlib doctests (#449)
deepyaman Nov 28, 2023
000f715
chore(datasets): Fix more doctest issues (#451)
merelcht Nov 28, 2023
b089ff5
test(datasets): fix failing doctests in Windows CI (#457)
deepyaman Nov 30, 2023
ac53eb0
chore(datasets): fix accidental reference to NumPy (#450)
deepyaman Nov 30, 2023
e96f705
chore(datasets): don't pollute dev env in doctests (#452)
deepyaman Nov 30, 2023
339f7d5
feat: Add tools to heap event (#430)
lrcouto Nov 30, 2023
46e3b84
ci(datasets): install deps in single `pip install` (#454)
deepyaman Dec 5, 2023
272d0c8
build(datasets): Bump s3fs (#463)
merelcht Dec 7, 2023
a8aa3dd
test(datasets): make SQL dataset examples runnable (#455)
deepyaman Dec 7, 2023
92583c4
fix(datasets): correct pandas-gbq as py311 dependency (#460)
kuruonur1 Dec 7, 2023
63a8e67
docs(datasets): Document `IncrementalDataset` (#468)
astrojuanlu Dec 8, 2023
31c4b31
chore: Update datasets to be arguments keyword only (#466)
merelcht Dec 8, 2023
b07bf75
chore: Clean up code for old dataset syntax compatibility (#465)
merelcht Dec 8, 2023
2f69ced
chore: Update scikit-learn version (#469)
noklam Dec 8, 2023
3591ec2
feat(datasets): support versioning data partitions (#447)
deepyaman Dec 11, 2023
3c0788f
docs(datasets): Improve documentation index (#428)
astrojuanlu Dec 11, 2023
2624e4a
docs(datasets): update wrong docstring about `con` (#461)
deepyaman Dec 11, 2023
949ad5d
build(datasets): Release `2.0.0` (#472)
merelcht Dec 11, 2023
c75bd3d
ci(telemetry): Pin `PyYAML` (#474)
ankatiyar Dec 12, 2023
8546022
build(telemetry): Release 0.3.1 (#475)
SajidAlamQB Dec 12, 2023
b04daab
docs(datasets): Fix broken links in README (#477)
astrojuanlu Dec 12, 2023
abc549b
chore(datasets): replace more "data_set" instances (#476)
deepyaman Dec 12, 2023
f12fb39
chore(datasets): Fix doctests (#488)
merelcht Dec 19, 2023
9cab70c
chore(datasets): Fix delta + incremental dataset docstrings (#489)
merelcht Dec 20, 2023
4027516
chore(airflow): Post 0.19 cleanup (#478)
ankatiyar Dec 20, 2023
16f218e
build(airflow): Release 0.8.0 (#491)
ankatiyar Dec 20, 2023
8ad0ecd
fix: telemetry metadata (#495)
DimedS Dec 21, 2023
17aa70e
fix: Update tests on kedro-docker for 0.5.0 release. (#496)
lrcouto Dec 21, 2023
f3ba239
build: Release kedro-docker 0.5.0 (#497)
lrcouto Dec 21, 2023
c5dfa43
chore(datasets): Update partitioned dataset docstring (#502)
merelcht Jan 3, 2024
6504217
Fix GeotiffDataset import + casing
merelcht Jan 4, 2024
3dc954c
Fix lint
merelcht Jan 4, 2024
fb3fb5c
fix(datasets): Relax pandas.HDFDataSet dependencies which are broken …
Galileo-Galilei Jan 4, 2024
b7cf4b3
fix: airflow metadata (#498)
AhdraMeraliQB Jan 4, 2024
8cdb8db
chore(airflow): Bump `apache-airflow` version (#511)
ankatiyar Jan 11, 2024
90c037c
ci(datasets): Unpin dask (#522)
ankatiyar Jan 22, 2024
695cdd3
feat(datasets): Add `MatlabDataset` to `kedro-datasets` (#515)
samuel-lee-sj Jan 22, 2024
3978a00
ci(airflow): Pin `Flask-Session` version (#521)
ankatiyar Jan 23, 2024
2f58c1e
feat: `kedro-airflow` group in memory nodes (#241)
sbrugman Jan 29, 2024
8e7123d
ci(datasets): Update pyproject.toml to pin Kedro 0.19 for kedro-datas…
noklam Jan 29, 2024
9c51583
feat(airflow): include environment name in DAG filename (#492)
sbrugman Jan 31, 2024
df12088
feat(datasets): Enable search-as-you type on Kedro-datasets docs (#532)
rashidakanchwala Feb 1, 2024
9339a46
fix(datasets): Debug and fix `kedro-datasets` nightly build failures …
SajidAlamQB Feb 8, 2024
f6218b9
feat(datasets): Dataset Preview Refactor (#504)
rashidakanchwala Feb 8, 2024
448e7aa
fix(datasets): Drop pyarrow constraint when using snowpark (#538)
felipemonroy Feb 8, 2024
1a5d0a1
docs: Update kedro-telemetry docs on which data is collected (#546)
DimedS Feb 12, 2024
5029f30
ci(docker): Trying to fix e2e tests (#548)
ankatiyar Feb 12, 2024
d023b5c
chore: bump actions versions (#539)
ankatiyar Feb 13, 2024
0d95de5
docs(telemetry): Direct readers to Kedro documentation for further in…
astrojuanlu Feb 15, 2024
9b7aecc
fix: kedro-telemetry masking (#552)
DimedS Feb 15, 2024
1998edb
fix: telemetry data and add example_pipeline (#557)
DimedS Feb 23, 2024
217bea3
feat(airflow): make sure `kedro airflow create` is deterministic (#525)
noklam Feb 27, 2024
1377b20
feat(telemetry): Add is_ci_env environment to reported project teleme…
AhdraMeraliQB Feb 27, 2024
551f86b
chore(datasets): Update setup.py moto dependency (#537)
noklam Feb 27, 2024
3b0bd53
feat(datasets): allow additional parameters for sqlalchemy engine whe…
mjspier Feb 28, 2024
6884f6b
build(datasets): Release `kedro-datasets` 2.1.0 (#575)
ankatiyar Feb 28, 2024
267d35d
feat(datasets): Add NetCDFDataSet class (#360)
riley-brady Feb 28, 2024
b5b82c6
Fix Geotiff dataset docstring
merelcht Feb 28, 2024
a864a4f
Improve test coverage
merelcht Feb 29, 2024
d54de28
build(telemetry): Release 0.3.2 (#588)
ankatiyar Feb 29, 2024
b1f16c9
chore(datasets): Normalise optional requirements names and move them …
ankatiyar Feb 29, 2024
cd83be8
ci(datasets): Accelerate CI using `uv` (#569)
astrojuanlu Mar 1, 2024
25a23e8
ci(docker): Only run one pipeline with the `ParallelRunner` in the e2…
ankatiyar Mar 4, 2024
23b6393
fix(datasets): sql_dataset load_args:params must be a tuple (MSSQL on…
andrewcao1 Mar 4, 2024
5256c83
build(datasets): fix typos in definition of extras (#593)
deepyaman Mar 4, 2024
7e992a0
fix(airflow): Fix incorrect node names (#594)
ankatiyar Mar 5, 2024
4e0f798
fix(datasets): Various fixes (#608)
astrojuanlu Mar 15, 2024
3dc6731
feat(datasets): Add kwargs for huggingface.HFDataset (#580)
eromerobilbomatica Mar 16, 2024
1f4c8f8
feat(telemetry): Unique User IDs in kedro-telemetry - merge only for …
DimedS Mar 18, 2024
5a6041d
build(telemetry): Add python 3.12 support (#615)
ankatiyar Mar 19, 2024
f7c6f5f
feat(airflow): add --tags option for node filtering in airflow create…
DimedS Mar 19, 2024
aa487d8
fix(datasets): make connection_args optional (#586)
jerome-asselin-buspatrol Mar 20, 2024
34a7731
chore(telemetry): Remove dummy e2e test (#622)
ankatiyar Mar 21, 2024
17d91b9
refactor(airflow): Remove `bootstrap_project` (#599)
ankatiyar Mar 21, 2024
9c1bfa3
build(datasets): Add Python 3.12 support (#617)
ankatiyar Mar 27, 2024
2e96c9f
feat(datasets): Extend preview mechanism (#595)
SajidAlamQB Apr 3, 2024
0edf586
fix: Add `--telemetry` flag to to kedro-telemetry CLI masking tests. …
lrcouto Apr 8, 2024
9b1c920
fix: Drop support for `pandas < 2.0`, `pyspark < 3.0` on `kedro-datas…
lrcouto Apr 10, 2024
e456b69
feat(datasets): add dataset to load/save with Ibis (#560)
deepyaman Apr 10, 2024
fb22a25
fix(datasets): Restore support for pandas 1 and PySpark 2 (#646)
astrojuanlu Apr 10, 2024
4d7da9e
feat: Release kedro-datasets version 3.0.0 (#644)
lrcouto Apr 10, 2024
7571403
docs: Add `experimental` directory (#635)
merelcht Apr 11, 2024
c38ecd7
docs: Add description of experimental and core datasets to contributi…
merelcht Apr 16, 2024
dc599e1
build(airflow): Add python 3.12 support (#614)
ankatiyar Apr 23, 2024
4aefa76
build(docker): Add Python 3.12 support (#616)
ankatiyar Apr 23, 2024
05aadfe
chore: Add `mypy` setup to Makefile (#638)
ankatiyar Apr 23, 2024
b8bd15f
build(telemetry): Release 0.4.0 (#648)
DimedS Apr 24, 2024
cac93e6
build(docker): Release 0.6.0 (#659)
ankatiyar Apr 24, 2024
88b627d
docs(datasets): add `ibis.TableDataset` to toctree (#657)
deepyaman Apr 25, 2024
5828a1d
chore(datasets): add `mypy` setup for datasets (#647)
ankatiyar Apr 30, 2024
850fcd7
docs(datasets): update config for Ibis 9.0 changes (#667)
deepyaman May 7, 2024
6e52420
chore: delete accidentally-committed coverage file (#665)
deepyaman May 7, 2024
67f9916
docs(datasets): Use `kedro-sphinx-theme` (#670)
ankatiyar May 8, 2024
38d6f01
build: use new `exclude_also` over `exclude_lines` (#666)
deepyaman May 9, 2024
56d3e4d
fix(airflow): Fix nodes grouping (#664)
ElenaKhaustova May 10, 2024
5905872
build(datasets): remove arbitrary bound for `s3fs` (#668)
deepyaman May 13, 2024
1e22c89
build(datasets): Make sure experimental datasets are packaged as part…
merelcht May 14, 2024
b2fd842
docs(datasets): add the extra install instructions (#669)
deepyaman May 15, 2024
41736e9
Move geotiff dataset + tests to experimental
merelcht May 17, 2024
90d2295
Fix pyproj.toml
merelcht May 17, 2024
27d52dd
Move experimental dataset to bottom of docs rst
merelcht May 17, 2024
d89f764
Separate experimental API docs
merelcht May 17, 2024
c98fd71
Delete accidental __init__ file
merelcht May 21, 2024
89867d9
build(docker): Unpin pip for `kedro-docker` (#680)
merelcht May 17, 2024
4ce267b
docs(datasets): Add setup for building API docs for experimental data…
merelcht May 21, 2024
82ddeb8
Fix lint
merelcht May 21, 2024
88a77d2
feat(datasets): `NetCDFDataset` support for `engine="h5netcdf"` [#620…
charlesbmi May 22, 2024
43f3932
docs(telemetry): Update badges (#690)
noklam May 23, 2024
cf8291d
docs(airflow): update badge (#693)
noklam May 23, 2024
15451dc
build(datasets):Make sure experimental dependencies are installed for…
merelcht May 23, 2024
823e63b
renaming to rioxarray and RasterDataset WIP
tgoelles May 24, 2024
5989f44
renaming to rioxarray and raster_dataset
tgoelles May 24, 2024
1fdf5ac
renaming WIP
tgoelles May 24, 2024
2c0a504
better docstring
tgoelles May 24, 2024
8c53ed6
rewrite tests WIP
tgoelles May 29, 2024
12b16f6
check for existing CRS on load
tgoelles May 29, 2024
f1823c5
testing with geotiff only WIP
tgoelles Jun 3, 2024
e9c16d9
Add sanity checks for data dimensions and coordinate reference system
tgoelles Jun 3, 2024
b0fbc08
multiband support WIP
tgoelles Jun 4, 2024
86599d3
better tests WIP
tgoelles Jun 4, 2024
8e53b4e
test no crs
tgoelles Jun 4, 2024
d2f6685
test no band
tgoelles Jun 4, 2024
321aa97
better no data handling
tgoelles Jun 4, 2024
5330eed
checking for tif file format
tgoelles Jun 4, 2024
6df47a9
renamed to geotiff
tgoelles Jun 4, 2024
974ab1a
better docstring and deleted clutter
tgoelles Jun 4, 2024
b81c694
formatted and docstring update
tgoelles Jun 4, 2024
156b2f4
build(datasets): Update badges for kedro-datasets (#689)
noklam May 23, 2024
8ad4870
fix(airflow): fix the link to correct deployment manual (#697)
DimedS May 24, 2024
c90fa6c
build(airflow): Release 0.9.0 (#698)
SajidAlamQB May 28, 2024
2560ebf
build(datasets): Release kedro-datasets 3.0.1 (#704)
SajidAlamQB May 29, 2024
cc856f3
feat(datasets): Add limited `langchain` support for Anthropic, Cohere…
ianwhale Jun 3, 2024
e193107
docs(datasets): Add langchain datasets to API docs (#711)
merelcht Jun 3, 2024
41455ec
fix(telemetry): Single project identifier (#701)
ElenaKhaustova Jun 4, 2024
07cb647
fixed more linting issues
tgoelles Jun 5, 2024
22073f7
fixed naming
tgoelles Jun 5, 2024
2d80027
end of line fix
tgoelles Jun 5, 2024
5959ef4
added support for tags
tgoelles Jun 6, 2024
649a76e
renamed to GeoTIFF
tgoelles Jun 6, 2024
04aeb91
fixed confusing docstring
tgoelles Jun 6, 2024
a4c00c7
deleted unused import
tgoelles Jun 6, 2024
c898af8
Merge branch 'main' into rioxarray
tgoelles Jun 6, 2024
889ec71
fixed linting issues
tgoelles Jun 6, 2024
985ed6e
added rioxarray requirements
tgoelles Jun 7, 2024
c48a7ac
run black
tgoelles Jun 7, 2024
47f5b01
Merge branch 'main' into rioxarray
ankatiyar Jun 10, 2024
3fd61e5
Update release notes properly
ankatiyar Jun 11, 2024
a0a5047
Merge branch 'main' into rioxarray
ankatiyar Jun 11, 2024
e170146
Merge branch 'main' into rioxarray
merelcht Jun 28, 2024
cd9ceb4
Merge branch 'main' into rioxarray
astrojuanlu Jun 28, 2024
959723c
Use get_filepath_str instead
noklam Jul 2, 2024
8f3c540
Revert "Use get_filepath_str instead"
noklam Jul 2, 2024
1e5f6a5
Empty
astrojuanlu Jul 4, 2024
6f34def
Merge branch 'main' into rioxarray
noklam Jul 5, 2024
bd08ab2
Merge branch 'main' into rioxarray
noklam Jul 5, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions kedro-datasets/RELEASE.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,13 @@
| `langchain.ChatCohereDataset` | A dataset for loading a ChatCohere langchain model. | `kedro_datasets_experimental.langchain` |
| `langchain.OpenAIEmbeddingsDataset` | A dataset for loading a OpenAIEmbeddings langchain model. | `kedro_datasets_experimental.langchain` |
| `langchain.ChatOpenAIDataset` | A dataset for loading a ChatOpenAI langchain model. | `kedro_datasets_experimental.langchain` |
| `rioxarray.GeoTIFFDataset` | A dataset for loading and saving geotiff raster data | `kedro_datasets_experimental.rioxarray` |
| `netcdf.NetCDFDataset` | A dataset for loading and saving "*.nc" files. | `kedro_datasets_experimental.netcdf` |

* `netcdf.NetCDFDataset` moved from `kedro_datasets` to `kedro_datasets_experimental`.

* Added the following new core datasets:

| Type | Description | Location |
|-------------------------------------|-----------------------------------------------------------|-----------------------------------------|
| `dask.CSVDataset` | A dataset for loading a CSV files using `dask` | `kedro_datasets.dask` |
Expand All @@ -22,6 +25,9 @@
## Community contributions

Many thanks to the following Kedroids for contributing PRs to this release:
* [Ian Whalen](https://github.com/ianwhale)
* [Charles Guan](https://github.com/charlesbmi)
* [Thomas Gölles](https://github.com/tgoelles)
* [Lukas Innig](https://github.com/derluke)
* [Michael Sexton](https://github.com/michaelsexton)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,4 @@ kedro_datasets_experimental
kedro_datasets_experimental.langchain.ChatOpenAIDataset
kedro_datasets_experimental.langchain.OpenAIEmbeddingsDataset
kedro_datasets_experimental.netcdf.NetCDFDataset
kedro_datasets_experimental.rioxarray.GeoTIFFDataset
13 changes: 13 additions & 0 deletions kedro-datasets/kedro_datasets_experimental/rioxarray/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
"""``AbstractDataset`` implementation to load/save data from/to a geospatial raster files."""
from __future__ import annotations

from typing import Any

import lazy_loader as lazy

# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
GeoTIFFDataset: Any

__getattr__, __dir__, __all__ = lazy.attach(
__name__, submod_attrs={"geotiff_dataset": ["GeoTIFFDataset"]}
)
Original file line number Diff line number Diff line change
@@ -0,0 +1,209 @@
"""GeoTIFFDataset loads geospatial raster data and saves it to a local geoiff file. The
underlying functionality is supported by rioxarray and xarray. A read rasterdata file
returns a xarray.DataArray object.
"""
import logging
from copy import deepcopy
from pathlib import PurePosixPath
from typing import Any

import fsspec
import rasterio
import rioxarray as rxr
import xarray
from kedro.io import AbstractVersionedDataset, DatasetError
from kedro.io.core import Version, get_filepath_str, get_protocol_and_path
from rasterio.crs import CRS
from rasterio.transform import from_bounds

logger = logging.getLogger(__name__)

SUPPORTED_DIMS = [("band", "x", "y"), ("x", "y")]
DEFAULT_NO_DATA_VALUE = -9999
SUPPORTED_FILE_FORMATS = [".tif", ".tiff"]


class GeoTIFFDataset(AbstractVersionedDataset[xarray.DataArray, xarray.DataArray]):
"""``GeoTIFFDataset`` loads and saves rasterdata files and reads them as xarray
DataArrays. The underlying functionality is supported by rioxarray, rasterio and xarray.

Reading and writing of single and multiband GeoTIFFs data is supported. There are sanity checks to ensure that a coordinate reference system (CRS) is present.
Supported dimensions are ("band", "x", "y") and ("x", "y") and xarray.DataArray with other dimension can not be saved to a GeoTIFF file.
Have a look at netcdf if this is what you need.


.. code-block:: yaml

sentinal_data:
type: rioxarray.GeoTIFFDataset
filepath: sentinal_data.tif

Example usage for the
`Python API <https://kedro.readthedocs.io/en/stable/data/\
advanced_data_catalog_usage.html>`_:

.. code-block:: pycon

>>> from kedro_datasets.rioxarray import GeoTIFFDataset
>>> import xarray as xr
>>> import numpy as np
>>>
>>> data = xr.DataArray(
... np.random.randn(2, 3, 2),
... dims=("band", "y", "x"),
... coords={"band": [1, 2], "y": [0.5, 1.5, 2.5], "x": [0.5, 1.5]}
... )
>>> data_crs = data.rio.write_crs("epsg:4326")
>>> data_spatial_dims = data_crs.rio.set_spatial_dims("x", "y")
>>> dataset = GeoTIFFDataset(filepath="test.tif")
>>> dataset.save(data_spatial_dims)
>>> reloaded = dataset.load()
>>> xr.testing.assert_allclose(data_spatial_dims, reloaded, rtol=1e-5)

"""

DEFAULT_LOAD_ARGS: dict[str, Any] = {}
DEFAULT_SAVE_ARGS: dict[str, Any] = {}

def __init__( # noqa: PLR0913
self,
*,
filepath: str,
load_args: dict[str, Any] | None = None,
save_args: dict[str, Any] | None = None,
version: Version | None = None,
metadata: dict[str, Any] | None = None,
):
"""Creates a new instance of ``GeoTIFFDataset`` pointing to a concrete
geospatial raster data file.


Args:
filepath: Filepath in POSIX format to a rasterdata file.
The prefix should be any protocol supported by ``fsspec``.
load_args: rioxarray options for loading rasterdata files.
Here you can find all available arguments:
https://corteva.github.io/rioxarray/html/rioxarray.html#rioxarray-open-rasterio
All defaults are preserved.
save_args: options for rioxarray for data without the band dimension and rasterio otherwhise.
version: If specified, should be an instance of
``kedro.io.core.Version``. If its ``load`` attribute is
None, the latest version will be loaded. If its ``save``
attribute is None, save version will be autogenerated.
metadata: Any arbitrary metadata.
This is ignored by Kedro, but may be consumed by users or external plugins.
"""
protocol, path = get_protocol_and_path(filepath, version)
self._protocol = protocol
self._fs = fsspec.filesystem(self._protocol)
self.metadata = metadata

super().__init__(
filepath=PurePosixPath(path),
version=version,
exists_function=self._fs.exists,
glob_function=self._fs.glob,
)

# Handle default load and save arguments
self._load_args = deepcopy(self.DEFAULT_LOAD_ARGS)
if load_args is not None:
self._load_args.update(load_args)
self._save_args = deepcopy(self.DEFAULT_SAVE_ARGS)
if save_args is not None:
self._save_args.update(save_args)

def _describe(self) -> dict[str, Any]:
return {
"filepath": self._filepath,
"protocol": self._protocol,
"load_args": self._load_args,
"save_args": self._save_args,
"version": self._version,
}

def _load(self) -> xarray.DataArray:
tgoelles marked this conversation as resolved.
Show resolved Hide resolved
load_path = self._get_load_path().as_posix()
with rasterio.open(load_path) as data:
tags = data.tags()
data = rxr.open_rasterio(load_path, **self._load_args)
Comment on lines +126 to +129
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried using the dataset with a remote path and it didn't work:

In [17]: ds = GeoTIFFDataset(filepath="https://download.osgeo.org/geotiff/samples/GeogToWGS84GeoKey/GeogToWGS84GeoKey5.tif")

In [18]: ds._load()
---------------------------------------------------------------------------
CPLE_OpenFailedError                      Traceback (most recent call last)
File rasterio/_base.pyx:310, in rasterio._base.DatasetBase.__init__()

File rasterio/_base.pyx:221, in rasterio._base.open_dataset()

File rasterio/_err.pyx:221, in rasterio._err.exc_wrap_pointer()

CPLE_OpenFailedError: download.osgeo.org/geotiff/samples/GeogToWGS84GeoKey/GeogToWGS84GeoKey5.tif: No such file or directory

During handling of the above exception, another exception occurred:

RasterioIOError                           Traceback (most recent call last)
Cell In[18], line 1
----> 1 ds._load()

File ~/Projects/QuantumBlackLabs/Kedro/kedro-plugins/kedro-datasets/kedro_datasets_experimental/rioxarray/geotiff_dataset.py:127, in GeoTIFFDataset._load(self)
    125 def _load(self) -> xarray.DataArray:
    126     load_path = self._get_load_path().as_posix()
--> 127     with rasterio.open(load_path) as data:
    128         tags = data.tags()
    129     data = rxr.open_rasterio(load_path, **self._load_args)

File ~/Projects/QuantumBlackLabs/Kedro/kedro/.venv/lib/python3.11/site-packages/rasterio/env.py:451, in ensure_env_with_credentials.<locals>.wrapper(*args, **kwds)
    448     session = DummySession()
    450 with env_ctor(session=session):
--> 451     return f(*args, **kwds)

File ~/Projects/QuantumBlackLabs/Kedro/kedro/.venv/lib/python3.11/site-packages/rasterio/__init__.py:304, in open(fp, mode, driver, width, height, count, crs, transform, dtype, nodata, sharing, **kwargs)
    301 path = _parse_path(raw_dataset_path)
    303 if mode == "r":
--> 304     dataset = DatasetReader(path, driver=driver, sharing=sharing, **kwargs)
    305 elif mode == "r+":
    306     dataset = get_writer_for_path(path, driver=driver)(
    307         path, mode, driver=driver, sharing=sharing, **kwargs
    308     )

File rasterio/_base.pyx:312, in rasterio._base.DatasetBase.__init__()

RasterioIOError: download.osgeo.org/geotiff/samples/GeogToWGS84GeoKey/GeogToWGS84GeoKey5.tif: No such file or directory

But rasterio knows how to deal with remote paths:

In [15]: with rasterio.open("https://download.osgeo.org/geotiff/samples/GeogToWGS84GeoKey/GeogToWGS84GeoKey5.tif") as data:
    ...:     tags = data.tags()
    ...: 

In [16]: tags
Out[16]: 
{'AREA_OR_POINT': 'Area',
 'TIFFTAG_ARTIST': '',
 'TIFFTAG_DATETIME': '2008:03:01 10:28:18',
 'TIFFTAG_RESOLUTIONUNIT': '2 (pixels/inch)',
 'TIFFTAG_SOFTWARE': 'Paint Shop Pro 8.0',
 'TIFFTAG_XRESOLUTION': '300',
 'TIFFTAG_YRESOLUTION': '300'}

so maybe there's something wrong in how the load path is handled here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@astrojuanlu Use get_filepath_str instead

The reason is that self._get_load_path doesn't handle protocols correctly. I suggest we merge this now and handle this separately as I think it probably affect more than 1 dataset

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we merge this now and handle this separately as I think it probably affect more than 1 dataset

I agree!

data.attrs.update(tags)
self._sanity_check(data)
logger.info(f"found coordinate rerence system {data.rio.crs}")
return data

def _save(self, data: xarray.DataArray) -> None:
self._sanity_check(data)
save_path = get_filepath_str(self._get_save_path(), self._protocol)
if not save_path.endswith(tuple(SUPPORTED_FILE_FORMATS)):
raise ValueError(
f"Unsupported file format. Supported formats are: {SUPPORTED_FILE_FORMATS}"
)
if "band" in data.dims:
self._save_multiband(data, save_path)
else:
data.rio.to_raster(save_path, **self._save_args)
self._fs.invalidate_cache(save_path)

def _exists(self) -> bool:
try:
load_path = get_filepath_str(self._get_load_path(), self._protocol)
except DatasetError:
return False

return self._fs.exists(load_path)

def _release(self) -> None:
super()._release()
self._invalidate_cache()

def _invalidate_cache(self) -> None:
"""Invalidate underlying filesystem caches."""
filepath = get_filepath_str(self._filepath, self._protocol)
self._fs.invalidate_cache(filepath)

def _save_multiband(self, data: xarray.DataArray, save_path: str):
"""Saving multiband raster data to a geotiff file."""
bands_data = [data.sel(band=band) for band in data.band.values]
transform = from_bounds(
west=data.x.min(),
south=data.y.min(),
east=data.x.max(),
north=data.y.max(),
width=data[0].shape[1],
height=data[0].shape[0],
)

nodata_value = (
data.rio.nodata if data.rio.nodata is not None else DEFAULT_NO_DATA_VALUE
)
crs = data.rio.crs

meta = {
"driver": "GTiff",
"height": bands_data[0].shape[0],
"width": bands_data[0].shape[1],
"count": len(bands_data),
"dtype": str(bands_data[0].dtype),
"crs": crs,
"transform": transform,
"nodata": nodata_value,
}
with rasterio.open(save_path, "w", **meta) as dst:
for idx, band in enumerate(bands_data, start=1):
dst.write(band.data, idx, **self._save_args)

def _sanity_check(self, data: xarray.DataArray) -> None:
"""Perform sanity checks on the data to ensure it meets the requirements."""
if not isinstance(data, xarray.DataArray):
raise NotImplementedError(
"Currently only supporting xarray.DataArray while saving raster data."
)

if not isinstance(data.rio.crs, CRS):
raise ValueError("Dataset lacks a coordinate reference system.")

if all(set(data.dims) != set(dims) for dims in SUPPORTED_DIMS):
raise ValueError(
f"Data has unsupported dimensions: {data.dims}. Supported dimensions are: {SUPPORTED_DIMS}"
)
Empty file.
Binary file not shown.
Loading