Domino AI Training Flow

This repo mocks a sample AI data prepartion and training script using Domino Flows. This flow:

Loads two datasets in from different sources
Merges the data together
Does some data preprocessing
Trains a model using the processed data

To run the flow, execute the following command in a workspace using the Domino Standard Environment:

pyflyte run --remote flow.py model_training --data_path_a /mnt/code/data/datasetA.csv --data_path_b /mnt/code/data/datasetB.csv

Once you run the command, navigate to Flows page > Runs pivot > Run name to monitor the flow execution:

Clicking on the View graph button at the top-right of the runs view will take you to the the Flyte UI where you can see a graph view of the flow:

Flow Definition

The flow is defined in the flow.py file. The sample flow contains 5 tasks - two for loading in data, one for merging the data together, one for preprocessing the data, and one for model training. Each task ultimately triggers a Domino Job and returns the outputs. We'll go through each of the tasks in detail.

Data Loading

The purpose of the data loading tasks is to load the data into flows and make a snapshot of it to capture the data at its current point in time. The code snippet below shows the definition for the two data loading tasks:

task1 = run_domino_job_task(
    flyte_task_name='Load Data A',
    command='python /mnt/code/scripts/load-data-A.py',
    hardware_tier_name='Small',
    inputs=[
        Input(name='data_path', type=str, value=data_path_A)
    ],
    output_specs=[
        Output(name='datasetA', type=FlyteFile[TypeVar('csv')])
    ],
    use_project_defaults_for_omitted=True
)

task2 = run_domino_job_task(
    flyte_task_name='Load Data B',
    command='python /mnt/code/scripts/load-data-B.py',
    hardware_tier_name='Small',
    inputs=[
        Input(name='data_path', type=str, value=data_path_B)
    ],
    output_specs=[
        Output(name='datasetB', type=FlyteFile[TypeVar('csv')])
    ],
    use_project_defaults_for_omitted=True
)

As you can see, we use the run_domino_job_task helper method to create the task with the appropriate parameters. Explaining each parameter in detail:

flyte_task_name: The name for the task.
command: The command that will be used in the Domino Job.
environment: The name of the environment you want to use for the task. If not specified, the default environment for the project will be used.
hardware_tier_name: The name of the hardware tier you want to use for the task. If not specified, the default hardware tier for the project will be used.
inputs: A list of inputs that the task depends on. Note how the data_path inputs are set to the arguments provided through the command line when starting the flow.
outputs: A list of outputs that will be produced by that task. Supported input and output types are documented here

Inputs can be accessed within Domino Jobs at /workflow/inputs/<NAME OF INPUT>. In a similar fashion, outputs must be written to /workflow/outputs/<NAME OF OUTPUT> inside the Domino Job for them to be tracked and returned in the task. See the scripts inside the script folder for more details.

Data Merging

With both datasets loaded in, the next task is to merge the datasets together into a single one. The code snippet below shows the definition for the data merging task:

task3 = run_domino_job_task(
    flyte_task_name='Merge Data',
    command='python /mnt/code/scripts/merge-data.py',
    hardware_tier_name='Small',
    inputs=[
        Input(name='datasetA', type=FlyteFile[TypeVar('csv')], value=task1['datasetA']),
        Input(name='datasetB', type=FlyteFile[TypeVar('csv')], value=task2['datasetB'])
    ],
    output_specs=[
        Output(name='merged_data', type=FlyteFile[TypeVar('csv')])
    ],
    use_project_defaults_for_omitted=True
)

As you can see, the same run_domino_job_task function is used again. One thing to note is how the output from the data data loading tasks are referenced via calling task1['datasetA'] / task2['datasetB'] and it specified as an input to the data merging task.

Data Preparation

The next task is to take the merged dataset and perform any necessary data cleaning. The code snippet below shows the definition for the data preprocessing task.

task4 = run_domino_job_task(
    flyte_task_name='Process Data',
    command='python /mnt/code/scripts/process-data.py',
    hardware_tier_name='Small',
    inputs=[
        Input(name='data_path', type=FlyteFile[TypeVar('csv')], value=task3['merged_data'])
    ],
    output_specs=[
        Output(name='processed_data', type=FlyteFile[TypeVar('csv')])
    ],
    use_project_defaults_for_omitted=True
)

As you can see, very similar structure as previous tasks.

Model training

The final task trains a model using the processed data. The code snippet below shows the definition for the model training task:

task5 = run_domino_job_task(
    flyte_task_name='Train Model',
    command='python /mnt/code/scripts/train-model.py',
    hardware_tier_name='Small',
    inputs=[
        Input(name='processed_data', type=FlyteFile[TypeVar('csv')], value=task4['processed_data']),
        Input(name='epochs', type=int, value=10),
        Input(name='batch_size', type=int, value=32)
    ],
    output_specs=[
        Output(name='model', type=FlyteFile[TypeVar('pkl')])
    ],
    use_project_defaults_for_omitted=True
)

The output of the last task is then returned as the final output to the flow.

Environment Requirements

The project just requires the Domino Standard Environment which comes included with all Domino deployments.

Hardware Requirements

This project works with a the default small-k8s tier that comes with all Domino deployments.

License

This template is licensed under Apache 2.0 and contains the following open source components:

Flytekit Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
data		data
screenshots		screenshots
scripts		scripts
template_definition		template_definition
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE.md		LICENSE.md
README.md		README.md
flow.py		flow.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Domino AI Training Flow

Flow Definition

Environment Requirements

Hardware Requirements

License

About

Releases

Packages

Languages

License

dominodatalab/domino-ai-flows

Folders and files

Latest commit

History

Repository files navigation

Domino AI Training Flow

Flow Definition

Environment Requirements

Hardware Requirements

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages