A package for dbt which enables standardization of data sets. You can use it to build a feature store in your data warehouse, without using external libraries like Spark's mllib or Python's scikit-learn.
The package contains a set of macros that mirror the functionality of the scikit-learn preprocessing module. Originally they were developed as part of the 2019 Medium article Feature Engineering in Snowflake.
Currently they have been tested in Snowflake, Redshift , BigQuery, SQL Server and PostgreSQL 13.2. The test case expectations have been built using scikit-learn (see *.py in integration_tests/data/sql), so you can expect behavioural parity with it.
The macros are:
scikit-learn function | macro name | Snowflake | BigQuery | Redshift | MSSQL | PostgreSQL | Example |
---|---|---|---|---|---|---|---|
KBinsDiscretizer | k_bins_discretizer | Y | Y | Y | Y | Y | |
LabelEncoder | label_encoder | Y | Y | Y | Y | Y | |
MaxAbsScaler | max_abs_scaler | Y | Y | Y | Y | Y | |
MinMaxScaler | min_max_scaler | Y | Y | Y | Y | Y | |
Normalizer | normalizer | Y | Y | Y | Y | Y | |
OneHotEncoder | one_hot_encoder | Y | Y | Y | Y | Y | |
QuantileTransformer | quantile_transformer | Y | Y | N | N | Y | |
RobustScaler | robust_scaler | Y | Y | Y | Y | Y | |
StandardScaler | standard_scaler | Y | Y | Y | N | Y |
* 2D charts taken from scikit-learn.org, GIFs are my own
To use this in your dbt project, create or modify packages.yml to include:
packages:
- package: "omnata-labs/dbt_ml_preprocessing"
version: [">=1.0.2"]
(replace the revision number with the latest)
Then run:
dbt deps
to import the package.
dbt-ml-preprocessing version 1.2.0 is the first version to support (and require) dbt 1.0.0.
If you are not ready to upgrade to dbt 1.0.0, please use dbt-ml-preprocessing version 1.0.2.
To read the macro documentation and see examples, simply generate your docs, and you'll see macro documentation in the Projects tree under dbt_ml_preprocessing
: