Name	Name	Last commit message	Last commit date
parent directory ..
1_transforms	1_transforms
2_estimators	2_estimators
3_pipelines	3_pipelines
README.md	README.md

Templates for Custom Tasks

In this directory there are three folders: transforms, estimators, and pipelines.

These are collectively known as "tasks". Each of these folders contain code snippets, i.e. tasks, that can be uploaded to DataRobot and incorporated into a blueprint. DataRobot will then run this custom code as part of a blueprint both during model training and model inference (i.e. scoring on new data). To make sure DataRobot can correctly execute this code, there are a few requirements:

There must be a custom.py or custom.R file (depending on which language you use)
The custom.py or custom.R file must contain one or multiple hooks. Note: the exact hooks you will use depend on both the task type (estimator or transform) and the language (R or Python). DataRobot will call these hooks during either model training or model inference (i.e scoring with new data). DataRobot will pass in the appropriate data and artifacts (e.g. serialized model) to each hook as parameters so that they are available to your custom code

There are also several recommended best practices, although these are not requirements:

You can include a model-metadata.yaml file. This will define what problems your custom task can be used on (e.g. regression, classification, etc.) and what input it will accept (e.g. sparse vs. dense input). See detailed documentation on model-metadata here or in the examples linked below
You can also include multiple supporting files as needed, e.g. helper files to clean up your code. You can see several examples of this is in the folders linked below

Once uploaded each of these three custom task types (transforms, estimators, pipelines) will appear as a single box in the Blueprint. Please refer to the Composable ML documentation to learn more about how tasks work.

The primary difference between the 3 custom task types are their output and their scope:

transforms - transform the input data, e.g. one hot encoding, numeric scaling, etc. Their output is always a dataframe to be used by other transforms or estimators. Note that the output also includes headers / column names.
estimators - predict (i.e. estimate) a new value using the input data, e.g. Logistic Regression or SVM. In DataRobot the final task, i.e. box, in any Blueprint must be an estimator. Note: a single Blueprint can, and often does, contain multiple estimators.
pipelines - allow a user to create a single task, i.e. box, in the Blueprint that incorporates multiple transforms and/or estimators. This is useful if you have a fully developed model pipeline with preprocessing and just wants to upload the entire functionality to DataRobot. Often blueprints utilizing pipelines will only have one task, which is connected to all the input data types, and handles preprocessing and prediction. One advantage of pipelines is that you can then download the entire trained model from DataRobot as one file. The disadvantage is the component transforms / estimators in a pipeline can't be used independently by other blueprints, e.g. if you create a custom missing logic imputation that you want every blueprint to use.

Note: We recommend that you use transforms and estimators instead of pipelines when possible to promote reusable tasks. Transforms and estimators can often be used across multiple blueprints, e.g. a categorical encoding transform might be added to both a linear model and a neural network. Pipelines in contrast are more difficult to reuse because they often tightly couple the transforms and estimators.

To summarize: the key difference between transforms, estimators, and pipelines is the "hooks", i.e. functions that are automatically called by DataRobot. Both transforms and estimators support the init (mostly for tasks written in R to load libraries), fit, and load_model (used if the model is serialized in a non-standard format) hooks. The difference is that transforms also support the transform hook (to transform input data) while estimators support the score hook (to generate predictions at inference time). As you would expect, the pipelines support all of the above hooks because they can incorporate both transforms and estimators.

Note: Some estimator and pipeline examples below do not have score hooks. This is to demonstrate that DataRobot can automatically apply the correct scoring functionality if your estimator or pipeline uses the default sklearn, pytorch, keras, or xgboost scoring functions. E.g. if you have a sklearn multiclass estimator such as a DecisionTreeClassifer and don't have a score hook defined, then DataRobot will automatically call model.predict_proba() and output the results.

If a template is tagged as "Verified", this means we have automated tests which guarantee integration functionality with the DataRobot platform

Data format

When working with structured models DRUM supports data as files of csv, sparse, or arrow format.
DRUM doesn't perform sanitation of missing or strange(containing parenthesis, slash, etc symbols) column names.

Transforms

python_missing_values - Good starter python transform. It uses builtin pandas functionality to impute missing values
r_transform_simple - Good starter R transform. It imputes missing values with a median
python3_sklearn_transform - Verified - This is a python template implementing a preprocessing-only SKLearn pipeline, handling categorical and numeric variables using an external helper file
r_transform_recipe - Demonstrates how to use caret style recipe in R transform
custom_class_python - Shows how to define your own transformation class without relying on external libraries
python3_image_transform - Shows how to handle input and output of image data type with DataRobot
python3_bool_to_int_transform - A simple transform that converts boolean true/false to integer 1/0

Estimators

python_regression - Implements a linear regressor with SGD training with hyperparameter definition
r_regression - Implements a GLM regressor in R
r_sparse_regression - Demonstrates how to use sparse input with R
python_binary_classification - Shows how to create an estimator for binary classification problems, with hyperparameters
r_binary_classification - Implements a GLM binary classifier in R
python_multiclass_classification - Implements a linear classifier with SGD training
python_anomaly - Shows how to create an anomaly estimator with hyperparameters
r_anomaly_detection - Shows how to create an anomaly estimator in R
python_regression_with_custom_class - Uses a custom python class, CustomCalibrator, which implements fit and score from scratch with no external libraries
python_calibrator - Shows how to create an estimator task for doing prediction calibration, which is usually done in DataRobot as an additional estimator step after the main estimator
python_huggingface_vit - Shows the workflow of fine-tuning a huggingface model return values set to properly work with DataRobot.

Pipelines

Except for where noted, our pipelines are verified within our functional test framework.

python3_sklearn_regression - Preprocessing with numeric, categorical and text, then SVD, with a Ridge regression estimator at the end
r_lang - This R pipeline can support either binary classification or regression out of the box.
python3_xgboost - XGBoost also has a DRUM predictor provided that knows how to predict all XGBoost models
python3_sklearn_binary - Preprocessing with numeric, categorical and text, then SVD, with a linear model estimator at the end
python3_sklearn_multiclass - Handles text, numeric and categorical inputs. Relies on sklearn dropin env and predictor.
python3_anomaly_detection - Provides an uncalibrated Sklearn anomaly pipeline
python3_calibrated_anomaly_detection - Provides a calibrated Sklearn anomaly pipeline
python3_sklearn_with_custom_classes - This pipeline shows you how to build an estimator independent from any library [Not Verified]
python3_sparse - A pipeline that intakes sparse data (csr_matrix) from an MTX file
python3_pytorch_regression - Provides a pytorch pipeline. Also uses the transform hook
python3_pytorch - Provides a pytorch pipeline. Also uses the transform hook
python3_pytorch_multiclass - Parses the passed in class_labels file. Also uses transform, and also relies on the internal PyTorch prediction implementation
python3_keras_joblib - Contains the option for both binary classification and regression. Serializes to the h5 format
python3_keras_vizai_joblib - Trains a keras model on base64 encoded images.

There is also a growing repository of re-usable tasks that might address your use case. You can find it here: https://github.com/datarobot-community/custom-models/tree/master/custom_tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

task_templates

task_templates

README.md

Templates for Custom Tasks

Data format

Transforms

Estimators

Pipelines

Files

task_templates

Directory actions

More options

Directory actions

More options

Latest commit

History

task_templates

Folders and files

parent directory

README.md

Templates for Custom Tasks

Data format

Transforms

Estimators

Pipelines