perf(datasets): lazily load datasets in init files (#277)

* perf(datasets): lazily load datasets in init files (api) Signed-off-by: Deepyaman Datta <[email protected]> * perf(datasets): lazily load datasets in init files (pandas) Signed-off-by: Deepyaman Datta <[email protected]> * fix(datasets): fix no name in module in api/pandas Signed-off-by: Deepyaman Datta <[email protected]> * perf(datasets): lazily load datasets in init files (biosequence) Signed-off-by: Deepyaman Datta <[email protected]> * perf(datasets): lazily load datasets in init files (dask) Signed-off-by: Deepyaman Datta <[email protected]> * perf(datasets): lazily load datasets in init files (databricks) Signed-off-by: Deepyaman Datta <[email protected]> * perf(datasets): lazily load datasets in init files (email) Signed-off-by: Deepyaman Datta <[email protected]> * perf(datasets): lazily load datasets in init files (geopandas) Signed-off-by: Deepyaman Datta <[email protected]> * perf(datasets): lazily load datasets in init files (holoviews) Signed-off-by: Deepyaman Datta <[email protected]> * perf(datasets): lazily load datasets in init files (json) Signed-off-by: Deepyaman Datta <[email protected]> * fix(datasets): resolve "too few public attributes" Signed-off-by: Deepyaman Datta <[email protected]> * perf(datasets): lazily load datasets in init files (matplotlib) Signed-off-by: Deepyaman Datta <[email protected]> * perf(datasets): lazily load datasets in init files (networkx) Signed-off-by: Deepyaman Datta <[email protected]> * perf(datasets): lazily load datasets in init files (pickle) Signed-off-by: Deepyaman Datta <[email protected]> * perf(datasets): lazily load datasets in init files (pillow) Signed-off-by: Deepyaman Datta <[email protected]> * perf(datasets): lazily load datasets in init files (plotly) Signed-off-by: Deepyaman Datta <[email protected]> * perf(datasets): lazily load datasets in init files (polars) Signed-off-by: Deepyaman Datta <[email protected]> * perf(datasets): lazily load datasets in init files (redis) Signed-off-by: Deepyaman Datta <[email protected]> * perf(datasets): lazily load datasets in init files (snowflake) Signed-off-by: Deepyaman Datta <[email protected]> * perf(datasets): lazily load datasets in init files (spark) Signed-off-by: Deepyaman Datta <[email protected]> * perf(datasets): lazily load datasets in init files (svmlight) Signed-off-by: Deepyaman Datta <[email protected]> * perf(datasets): lazily load datasets in init files (tensorflow) Signed-off-by: Deepyaman Datta <[email protected]> * perf(datasets): lazily load datasets in init files (text) Signed-off-by: Deepyaman Datta <[email protected]> * perf(datasets): lazily load datasets in init files (tracking) Signed-off-by: Deepyaman Datta <[email protected]> * perf(datasets): lazily load datasets in init files (video) Signed-off-by: Deepyaman Datta <[email protected]> * perf(datasets): lazily load datasets in init files (yaml) Signed-off-by: Deepyaman Datta <[email protected]> * Update RELEASE.md --------- Signed-off-by: Deepyaman Datta <[email protected]>
kedro-org · Jul 31, 2023 · 3aad425 · 3aad425
1 parent fd4b2be
commit 3aad425
Show file tree

Hide file tree

Showing 28 changed files with 240 additions and 169 deletions.
diff --git a/kedro-datasets/RELEASE.md b/kedro-datasets/RELEASE.md
@@ -1,9 +1,13 @@
 # Upcoming Release:
 
 ## Major features and improvements
-* Added automatic inference of file format for `pillow.ImageDataSet` to be passed to `save()`
+* Implemented lazy loading of dataset subpackages and classes.
+    * Suppose that SQLAlchemy, a Python SQL toolkit, is installed in your Python environment. With this change, the SQLAlchemy library will not be loaded (for `pandas.SQLQueryDataSet` or `pandas.SQLTableDataSet`) if you load a different pandas dataset (e.g. `pandas.CSVDataSet`).
+* Added automatic inference of file format for `pillow.ImageDataSet` to be passed to `save()`.
 
 ## Bug fixes and other changes
+* Improved error messages for missing dataset dependencies.
+    * Suppose that SQLAlchemy, a Python SQL toolkit, is not installed in your Python environment. Previously, `from kedro_datasets.pandas import SQLQueryDataSet` or `from kedro_datasets.pandas import SQLTableDataSet` would result in `ImportError: cannot import name 'SQLTableDataSet' from 'kedro_datasets.pandas'`. Now, the same imports raise the more helpful and intuitive `ModuleNotFoundError: No module named 'sqlalchemy'`.
 
 ## Community contributions
 Many thanks to the following Kedroids for contributing PRs to this release:
@@ -12,7 +16,7 @@ Many thanks to the following Kedroids for contributing PRs to this release:
 
 # Release 1.4.2
 ## Bug fixes and other changes
-* Fixed documentations of `GeoJSONDataSet` and `SparkStreamingDataSet`
+* Fixed documentations of `GeoJSONDataSet` and `SparkStreamingDataSet`.
 * Fixed problematic docstrings causing Read the Docs builds on Kedro to fail.
 
 # Release 1.4.1:
@@ -32,16 +36,16 @@ Many thanks to the following Kedroids for contributing PRs to this release:
 ## Major features and improvements
 * Added pandas 2.0 support.
 * Added SQLAlchemy 2.0 support (and dropped support for versions below 1.4).
-* Added a save method to the APIDataSet
+* Added a save method to `APIDataSet`.
 * Reduced constructor arguments for `APIDataSet` by replacing most arguments with a single constructor argument `load_args`. This makes it more consistent with other Kedro DataSets and the underlying `requests` API, and automatically enables the full configuration domain: stream, certificates, proxies, and more.
-* Relaxed Kedro version pin to `>=0.16`
+* Relaxed Kedro version pin to `>=0.16`.
 * Added `metadata` attribute to all existing datasets. This is ignored by Kedro, but may be consumed by users or external plugins.
 * Added `ManagedTableDataSet` for managed delta tables on Databricks.
 
 ## Bug fixes and other changes
 * Relaxed `delta-spark` upper bound to allow compatibility with Spark 3.1.x and 3.2.x.
 * Upgraded required `polars` version to 0.17.
-* Renamed `TensorFlowModelDataset` to `TensorFlowModelDataSet` to be consistent with all other plugins in kedro-datasets.
+* Renamed `TensorFlowModelDataset` to `TensorFlowModelDataSet` to be consistent with all other plugins in Kedro-Datasets.
 
 ## Community contributions
 Many thanks to the following Kedroids for contributing PRs to this release:
@@ -102,11 +106,11 @@ Datasets are Kedro’s way of dealing with input and output in a data and machin
 The datasets have always been part of the core Kedro Framework project inside `kedro.extras`. In Kedro `0.19.0`, we will remove datasets from Kedro to reduce breaking changes associated with dataset dependencies. Instead, users will need to use the datasets from the `kedro-datasets` repository instead.
 
 ## Major features and improvements
-* Changed `pandas.ParquetDataSet` to load data using pandas instead of parquet
+* Changed `pandas.ParquetDataSet` to load data using pandas instead of parquet.
 
 # Release 0.1.0:
 
-The initial release of `kedro-datasets`.
+The initial release of Kedro-Datasets.
 
 ## Thanks to our main contributors
 

diff --git a/kedro-datasets/kedro_datasets/api/__init__.py b/kedro-datasets/kedro_datasets/api/__init__.py
@@ -2,10 +2,13 @@
 and returns them into either as string or json Dict.
 It uses the python requests library: https://requests.readthedocs.io/en/latest/
 """
+from typing import Any
 
-__all__ = ["APIDataSet"]
+import lazy_loader as lazy
 
-from contextlib import suppress
+# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
+APIDataSet: Any
 
-with suppress(ImportError):
-    from .api_dataset import APIDataSet
+__getattr__, __dir__, __all__ = lazy.attach(
+    __name__, submod_attrs={"api_dataset": ["APIDataSet"]}
+)
diff --git a/kedro-datasets/kedro_datasets/biosequence/__init__.py b/kedro-datasets/kedro_datasets/biosequence/__init__.py
@@ -1,8 +1,11 @@
 """``AbstractDataSet`` implementation to read/write from/to a sequence file."""
+from typing import Any
 
-__all__ = ["BioSequenceDataSet"]
+import lazy_loader as lazy
 
-from contextlib import suppress
+# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
+BioSequenceDataSet: Any
 
-with suppress(ImportError):
-    from .biosequence_dataset import BioSequenceDataSet
+__getattr__, __dir__, __all__ = lazy.attach(
+    __name__, submod_attrs={"biosequence_dataset": ["BioSequenceDataSet"]}
+)
diff --git a/kedro-datasets/kedro_datasets/dask/__init__.py b/kedro-datasets/kedro_datasets/dask/__init__.py
@@ -1,8 +1,11 @@
 """Provides I/O modules using dask dataframe."""
+from typing import Any
 
-__all__ = ["ParquetDataSet"]
+import lazy_loader as lazy
 
-from contextlib import suppress
+# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
+ParquetDataSet: Any
 
-with suppress(ImportError):
-    from .parquet_dataset import ParquetDataSet
+__getattr__, __dir__, __all__ = lazy.attach(
+    __name__, submod_attrs={"parquet_dataset": ["ParquetDataSet"]}
+)
diff --git a/kedro-datasets/kedro_datasets/databricks/__init__.py b/kedro-datasets/kedro_datasets/databricks/__init__.py
@@ -1,8 +1,11 @@
 """Provides interface to Unity Catalog Tables."""
+from typing import Any
 
-__all__ = ["ManagedTableDataSet"]
+import lazy_loader as lazy
 
-from contextlib import suppress
+# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
+ManagedTableDataSet: Any
 
-with suppress(ImportError):
-    from .managed_table_dataset import ManagedTableDataSet
+__getattr__, __dir__, __all__ = lazy.attach(
+    __name__, submod_attrs={"managed_table_dataset": ["ManagedTableDataSet"]}
+)
diff --git a/kedro-datasets/kedro_datasets/email/__init__.py b/kedro-datasets/kedro_datasets/email/__init__.py
@@ -1,8 +1,11 @@
 """``AbstractDataSet`` implementations for managing email messages."""
+from typing import Any
 
-__all__ = ["EmailMessageDataSet"]
+import lazy_loader as lazy
 
-from contextlib import suppress
+# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
+EmailMessageDataSet: Any
 
-with suppress(ImportError):
-    from .message_dataset import EmailMessageDataSet
+__getattr__, __dir__, __all__ = lazy.attach(
+    __name__, submod_attrs={"message_dataset": ["EmailMessageDataSet"]}
+)
diff --git a/kedro-datasets/kedro_datasets/geopandas/__init__.py b/kedro-datasets/kedro_datasets/geopandas/__init__.py
@@ -1,8 +1,11 @@
-"""``GeoJSONDataSet`` is an ``AbstractVersionedDataSet`` to save and load GeoJSON files.
-"""
-__all__ = ["GeoJSONDataSet"]
+"""``GeoJSONDataSet`` is an ``AbstractVersionedDataSet`` to save and load GeoJSON files."""
+from typing import Any
 
-from contextlib import suppress
+import lazy_loader as lazy
 
-with suppress(ImportError):
-    from .geojson_dataset import GeoJSONDataSet
+# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
+GeoJSONDataSet: Any
+
+__getattr__, __dir__, __all__ = lazy.attach(
+    __name__, submod_attrs={"geojson_dataset": ["GeoJSONDataSet"]}
+)
diff --git a/kedro-datasets/kedro_datasets/holoviews/__init__.py b/kedro-datasets/kedro_datasets/holoviews/__init__.py
@@ -1,8 +1,11 @@
 """``AbstractDataSet`` implementation to save Holoviews objects as image files."""
+from typing import Any
 
-__all__ = ["HoloviewsWriter"]
+import lazy_loader as lazy
 
-from contextlib import suppress
+# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
+HoloviewsWriter: Any
 
-with suppress(ImportError):
-    from .holoviews_writer import HoloviewsWriter
+__getattr__, __dir__, __all__ = lazy.attach(
+    __name__, submod_attrs={"holoviews_writer": ["HoloviewsWriter"]}
+)
diff --git a/kedro-datasets/kedro_datasets/json/__init__.py b/kedro-datasets/kedro_datasets/json/__init__.py
@@ -1,8 +1,11 @@
 """``AbstractDataSet`` implementation to load/save data from/to a JSON file."""
+from typing import Any
 
-__all__ = ["JSONDataSet"]
+import lazy_loader as lazy
 
-from contextlib import suppress
+# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
+JSONDataSet: Any
 
-with suppress(ImportError):
-    from .json_dataset import JSONDataSet
+__getattr__, __dir__, __all__ = lazy.attach(
+    __name__, submod_attrs={"json_dataset": ["JSONDataSet"]}
+)
diff --git a/kedro-datasets/kedro_datasets/matplotlib/__init__.py b/kedro-datasets/kedro_datasets/matplotlib/__init__.py
@@ -1,8 +1,10 @@
 """``AbstractDataSet`` implementation to save matplotlib objects as image files."""
+from typing import Any
 
-__all__ = ["MatplotlibWriter"]
+import lazy_loader as lazy
 
-from contextlib import suppress
+MatplotlibWriter: Any
 
-with suppress(ImportError):
-    from .matplotlib_writer import MatplotlibWriter
+__getattr__, __dir__, __all__ = lazy.attach(
+    __name__, submod_attrs={"matplotlib_writer": ["MatplotlibWriter"]}
+)
diff --git a/kedro-datasets/kedro_datasets/networkx/__init__.py b/kedro-datasets/kedro_datasets/networkx/__init__.py
@@ -1,15 +1,19 @@
-"""``AbstractDataSet`` implementation to save and load NetworkX graphs in JSON
-, GraphML and GML formats using ``NetworkX``."""
+"""``AbstractDataSet`` implementation to save and load NetworkX graphs in JSON,
+GraphML and GML formats using ``NetworkX``."""
+from typing import Any
 
-__all__ = ["GMLDataSet", "GraphMLDataSet", "JSONDataSet"]
+import lazy_loader as lazy
 
-from contextlib import suppress
+# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
+GMLDataSet: Any
+GraphMLDataSet: Any
+JSONDataSet: Any
 
-with suppress(ImportError):
-    from .gml_dataset import GMLDataSet
-
-with suppress(ImportError):
-    from .graphml_dataset import GraphMLDataSet
-
-with suppress(ImportError):
-    from .json_dataset import JSONDataSet
+__getattr__, __dir__, __all__ = lazy.attach(
+    __name__,
+    submod_attrs={
+        "gml_dataset": ["GMLDataSet"],
+        "graphml_dataset": ["GraphMLDataSet"],
+        "json_dataset": ["JSONDataSet"],
+    },
+)
diff --git a/kedro-datasets/kedro_datasets/pandas/__init__.py b/kedro-datasets/kedro_datasets/pandas/__init__.py
@@ -1,42 +1,36 @@
 """``AbstractDataSet`` implementations that produce pandas DataFrames."""
+from typing import Any
 
-__all__ = [
-    "CSVDataSet",
-    "DeltaTableDataSet",
-    "ExcelDataSet",
-    "FeatherDataSet",
-    "GBQTableDataSet",
-    "GBQQueryDataSet",
-    "HDFDataSet",
-    "JSONDataSet",
-    "ParquetDataSet",
-    "SQLQueryDataSet",
-    "SQLTableDataSet",
-    "XMLDataSet",
-    "GenericDataSet",
-]
+import lazy_loader as lazy
 
-from contextlib import suppress
+# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
+CSVDataSet: Any
+DeltaTableDataSet: Any
+ExcelDataSet: Any
+FeatherDataSet: Any
+GBQQueryDataSet: Any
+GBQTableDataSet: Any
+GenericDataSet: Any
+HDFDataSet: Any
+JSONDataSet: Any
+ParquetDataSet: Any
+SQLQueryDataSet: Any
+SQLTableDataSet: Any
+XMLDataSet: Any
 
-with suppress(ImportError):
-    from .csv_dataset import CSVDataSet
-with suppress(ImportError):
-    from .deltatable_dataset import DeltaTableDataSet
-with suppress(ImportError):
-    from .excel_dataset import ExcelDataSet
-with suppress(ImportError):
-    from .feather_dataset import FeatherDataSet
-with suppress(ImportError):
-    from .gbq_dataset import GBQQueryDataSet, GBQTableDataSet
-with suppress(ImportError):
-    from .hdf_dataset import HDFDataSet
-with suppress(ImportError):
-    from .json_dataset import JSONDataSet
-with suppress(ImportError):
-    from .parquet_dataset import ParquetDataSet
-with suppress(ImportError):
-    from .sql_dataset import SQLQueryDataSet, SQLTableDataSet
-with suppress(ImportError):
-    from .xml_dataset import XMLDataSet
-with suppress(ImportError):
-    from .generic_dataset import GenericDataSet
+__getattr__, __dir__, __all__ = lazy.attach(
+    __name__,
+    submod_attrs={
+        "csv_dataset": ["CSVDataSet"],
+        "deltatable_dataset": ["DeltaTableDataSet"],
+        "excel_dataset": ["ExcelDataSet"],
+        "feather_dataset": ["FeatherDataSet"],
+        "gbq_dataset": ["GBQQueryDataSet", "GBQTableDataSet"],
+        "generic_dataset": ["GenericDataSet"],
+        "hdf_dataset": ["HDFDataSet"],
+        "json_dataset": ["JSONDataSet"],
+        "parquet_dataset": ["ParquetDataSet"],
+        "sql_dataset": ["SQLQueryDataSet", "SQLTableDataSet"],
+        "xml_dataset": ["XMLDataSet"],
+    },
+)
diff --git a/kedro-datasets/kedro_datasets/pickle/__init__.py b/kedro-datasets/kedro_datasets/pickle/__init__.py
@@ -1,8 +1,11 @@
 """``AbstractDataSet`` implementation to load/save data from/to a Pickle file."""
+from typing import Any
 
-__all__ = ["PickleDataSet"]
+import lazy_loader as lazy
 
-from contextlib import suppress
+# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
+PickleDataSet: Any
 
-with suppress(ImportError):
-    from .pickle_dataset import PickleDataSet
+__getattr__, __dir__, __all__ = lazy.attach(
+    __name__, submod_attrs={"pickle_dataset": ["PickleDataSet"]}
+)
diff --git a/kedro-datasets/kedro_datasets/pillow/__init__.py b/kedro-datasets/kedro_datasets/pillow/__init__.py
@@ -1,8 +1,11 @@
 """``AbstractDataSet`` implementation to load/save image data."""
+from typing import Any
 
-__all__ = ["ImageDataSet"]
+import lazy_loader as lazy
 
-from contextlib import suppress
+# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
+ImageDataSet: Any
 
-with suppress(ImportError):
-    from .image_dataset import ImageDataSet
+__getattr__, __dir__, __all__ = lazy.attach(
+    __name__, submod_attrs={"image_dataset": ["ImageDataSet"]}
+)
diff --git a/kedro-datasets/kedro_datasets/plotly/__init__.py b/kedro-datasets/kedro_datasets/plotly/__init__.py
@@ -1,11 +1,14 @@
 """``AbstractDataSet`` implementations to load/save a plotly figure from/to a JSON
 file."""
+from typing import Any
 
-__all__ = ["PlotlyDataSet", "JSONDataSet"]
+import lazy_loader as lazy
 
-from contextlib import suppress
+# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
+JSONDataSet: Any
+PlotlyDataSet: Any
 
-with suppress(ImportError):
-    from .plotly_dataset import PlotlyDataSet
-with suppress(ImportError):
-    from .json_dataset import JSONDataSet
+__getattr__, __dir__, __all__ = lazy.attach(
+    __name__,
+    submod_attrs={"json_dataset": ["JSONDataSet"], "plotly_dataset": ["PlotlyDataSet"]},
+)
diff --git a/kedro-datasets/kedro_datasets/polars/__init__.py b/kedro-datasets/kedro_datasets/polars/__init__.py
@@ -1,8 +1,11 @@
 """``AbstractDataSet`` implementations that produce pandas DataFrames."""
+from typing import Any
 
-__all__ = ["CSVDataSet"]
+import lazy_loader as lazy
 
-from contextlib import suppress
+# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
+CSVDataSet: Any
 
-with suppress(ImportError):
-    from .csv_dataset import CSVDataSet
+__getattr__, __dir__, __all__ = lazy.attach(
+    __name__, submod_attrs={"csv_dataset": ["CSVDataSet"]}
+)
diff --git a/kedro-datasets/kedro_datasets/redis/__init__.py b/kedro-datasets/kedro_datasets/redis/__init__.py
@@ -1,8 +1,11 @@
 """``AbstractDataSet`` implementation to load/save data from/to a redis db."""
+from typing import Any
 
-__all__ = ["PickleDataSet"]
+import lazy_loader as lazy
 
-from contextlib import suppress
+# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
+PickleDataSet: Any
 
-with suppress(ImportError):
-    from .redis_dataset import PickleDataSet
+__getattr__, __dir__, __all__ = lazy.attach(
+    __name__, submod_attrs={"redis_dataset": ["PickleDataSet"]}
+)