Both pip install kedro
and conda install -c conda-forge kedro
install the core Kedro module, which includes the CLI tool, project template, pipeline abstraction, framework, and support for configuration.
When you create a project, you then introduce additional dependencies for the tasks it performs.
When you create a new Kedro project, Kedro generates a requirements.txt
file in the root directory of the project. The file contains the core dependencies and those related to the tools you choose to include in the project. Specifying the project's exact dependencies in a requirements.txt
file makes it easier to run the project in the future, and avoids version conflicts downstream.
When someone clones your project, they can install the project-specific dependencies by navigating to the root directory of the project and running the following command:
pip install -r requirements.txt
The Data Catalog is your way of interacting with different data types in Kedro. You can use kedro-datasets
to interact with the data used in your projects. Depending on the datasets that you use in your Data Catalog, you might need to include additional dependencies in your requirements.txt
. The modular dependencies in this category include pandas
, numpy
, pyspark
, matplotlib
, pillow
, dask
, and more.
Data types are broken into groups e.g. pandas
, spark
and pickle
. Each group has a collection of data types e.g.pandas.CSVDataset
, pandas.ParquetDataset
and more. You can install dependencies for an entire group of dependencies as follows:
pip install "kedro-datasets[<group>]"
This installs Kedro and dependencies related to the data type group. An example of this could be a workflow that depends on the data types in pandas
. Run pip install "kedro-datasets[pandas]"
to install Kedro and the dependencies for the data types in the pandas
group.
To limit installation to dependencies specific to a data type:
pip install "kedro-datasets[<group>-<dataset>]"
For example, your workflow might require the pandas.ExcelDataset
, so to install its dependencies, run pip install "kedro-datasets[pandas-exceldataset]"
.
From `kedro-datasets` version 3.0.0 onwards, the names of the optional dataset-level dependencies have been normalised to follow [PEP 685](https://peps.python.org/pep-0685/). The '.' character has been replaced with a '-' character and the names are in lowercase. For example, if you had `kedro-datasets[pandas.ExcelDataset]` in your requirements file, it would have to be changed to `kedro-datasets[pandas-exceldataset]`.
To ensure that the project dependencies and the transitive dependencies are pinned to specific versions, use pip-tools
to compile requirements.txt
file into a requirements.lock
file.
To install pip-tools
in your virtual environment, run the following command:
pip install pip-tools
To add or remove dependencies to a project, edit the requirements.txt
file, then run the following:
pip-compile <project_root>/requirements.txt --output-file <project_root>/requirements.lock
This will pip compile the requirements listed in the requirements.txt
file into a requirements.lock
that specifies a list of pinned project dependencies(those with a strict version). You can also use this command with additional CLI arguments such as --generate-hashes
to use pip
's Hash Checking Mode or --upgrade-package
to update specific packages to the latest or specific versions.
Check out the pip-tools
documentation for more information.
The `requirements.txt` file contains "source" requirements, while `requirements.lock` contains the compiled version of those and requires no manual updates. If you need to update the dependencies, update the `requirements.txt` file and re-run the `pip-compile` command.