Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SageMaker on Flyte: TrainingJob for training with built-in algorithms and basic HPOJob support [Alpha] #120

Merged
merged 88 commits into from
Jul 31, 2020

Conversation

bnsblue
Copy link
Contributor

@bnsblue bnsblue commented Jun 3, 2020

TL;DR

This PR adds the necessary definitions for basic support of SageMaker TrainingJob and HPOJob (Hyperparameter Optimization)

Type

  • Bug Fix
  • Feature
  • Plugin

Are all requirements met?

  • Code completed
  • Smoke tested
  • Unit tests added
  • Code documentation added
  • Any pending items have an associated Issue

Complete description

This PR adds the basic support for users to invoke SageMaker TrainingJob (built-in algorithm mode) and SageMaker HPOJob from within Flyte. The follow paragraphs demonstrates the supported use cases.

Defining a Simple Training Job

Users can leverage SageMaker's powerful built-in algorithms easily without needing to write any function or logic. They can simply define a SdkBuildinAlgorithmTrainingJobTask in Flytekit and supplies the settings and the spec of the built-in algorithm as follows:

alg_spec = training_job_models.AlgorithmSpecification(
    input_mode=_sdk_sagemaker_types.InputMode.FILE,
    algorithm_name=_sdk_sagemaker_types.AlgorithmName.XGBOOST,
    algorithm_version="0.72",
    metric_definitions=[training_job_models.MetricDefinition(name="Minimize", regex="validation:error")]
)

# Definition of a SageMaker training job using SageMaker's built-in algorithm mode of XGBoost
xgboost_simple_train_task = training_job_task.SdkBuiltinAlgorithmTrainingJobTask(
    training_job_config=training_job_models.TrainingJobConfig(
        instance_type="ml.m4.xlarge",
        instance_count=1,
        volume_size_in_gb=25,
    ),
    algorithm_specification=alg_spec,
    cache_version='v1',
    cacheable=True,
)

Defining a simple Hyperparameter Tuning Job

SageMaker-on-Flyte also supports easy chaining between a TrainingJob task and a hpo job. After users define a TrainingJob task, he/she may want to apply hyperparameter tuning to the the training, while also maintaining the flexibility to run the TrainingJob task standalone. This should be easily doable by using the SdkSimpleHPOJobTask in our SDK. SdkSimpleHPOJobTask accepts the definition of a TrainingJob task as a part of the spec.

# Definition of a SageMaker hyperparameter-tuning job.
# Note that is chained behind the simple TrainingJob task defined above
xgboost_simple_hpo_task = hpo_job_task.SdkSimpleHPOJobTask(
    training_job=xgboost_simple_train_task,
    max_number_of_training_jobs=10,
    max_parallel_training_jobs=5,
    cache_version='2',
    retries=2,
    cacheable=True,
)

Invoking Training Jobs Task and Hyperparameter Tuning Jobs Task

Invoking Training Job Tasks and HPO Job Tasks from inside a Flyte workflow is pretty much the same as invoking other types of tasks. You should be able to achieve this by simply supplying the required inputs to the task definition. For Training Job Tasks and HPO Job Tasks , required inputs are pre-defined inputs that we think is needed for every training job. That's why you don't see the declaration of these inputs in the task definition -- we added them for you in our SDK.

@workflow_class
class SageMakerSimpleWorkflow(object):
    static_hyperparameters = Input(Types.Generic, required=True, help="Sample hyperparameter input")

    my_simple_trianing_task_exec = xgboost_simple_train_task(
        train="s3://path/to/train/data",
        validation="s3://path/to/validation/data",
        static_hyperparameters=static_hyperparameters,
        stopping_condition=StoppingCondition(
            max_runtime_in_seconds=43200,
        ).to_flyte_idl(),
    )
    ...

Flyte's Single Task Execution capability also makes it easy to invoke a SimpleTrainingJobTask and SimpleHPOJobTask. That is, users do not need a workflow to launch the SageMaker tasks; instead, they can simply define the tasks, and then register and launch the tasks standalone, which enables fast iterations.

xgboost_simple_train_task = training_job_task.SdkBuiltinAlgorithmTrainingJobTask( ... )

xgboost_simple_hpo_task = hpo_job_task.SdkSimpleHPOJobTask(
    training_job=xgboost_simple_train_task,
    ...
)

Tracking Issue

flyteorg/flyte#255

Follow-up issue

flyteorg/flyte#431

@bnsblue bnsblue marked this pull request as draft June 3, 2020 17:23
@codecov-commenter
Copy link

codecov-commenter commented Jun 3, 2020

Codecov Report

Merging #120 into master will increase coverage by 0.02%.
The diff coverage is 81.42%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #120      +/-   ##
==========================================
+ Coverage   80.83%   80.86%   +0.02%     
==========================================
  Files         219      225       +6     
  Lines       14313    14678     +365     
  Branches     1195     1205      +10     
==========================================
+ Hits        11570    11869     +299     
- Misses       2460     2526      +66     
  Partials      283      283              
Impacted Files Coverage Δ
flytekit/plugins/__init__.py 100.00% <ø> (ø)
flytekit/sdk/tasks.py 71.59% <28.57%> (-3.72%) ⬇️
flytekit/models/sagemaker/parameter_ranges.py 53.75% <53.75%> (ø)
flytekit/models/sagemaker/hpo_job.py 71.42% <71.42%> (ø)
...ytekit/common/tasks/sagemaker/training_job_task.py 92.85% <92.85%> (ø)
flytekit/models/sagemaker/training_job.py 95.45% <95.45%> (ø)
flytekit/__init__.py 100.00% <100.00%> (ø)
flytekit/common/constants.py 100.00% <100.00%> (ø)
flytekit/common/tasks/sagemaker/hpo_job_task.py 100.00% <100.00%> (ø)
...ts/flytekit/unit/sdk/tasks/test_sagemaker_tasks.py 100.00% <100.00%> (ø)
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a32dd2d...7dbac57. Read the comment docs.

@bnsblue bnsblue mentioned this pull request Jun 9, 2020
8 tasks
@bnsblue bnsblue changed the title Adding SageMaker trainingjob and hpojob SageMaker on Flyte: TrainingJob and HPOJob [Alpha] Jun 23, 2020
@bnsblue bnsblue changed the title SageMaker on Flyte: TrainingJob and HPOJob [Alpha] SageMaker on Flyte: TrainingJob and HPOJob support [Alpha] Jun 23, 2020
Copy link
Contributor

@wild-endeavor wild-endeavor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 - will approve after the final version is in.

from flytekit.common.constants import SdkTaskType


class SdkSimpleTrainingJobTask(_sdk_task.SdkTask):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does Simple mean?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this meant to be built in?

Copy link
Contributor Author

@bnsblue bnsblue Jul 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes -- meaning using built-in algorithm mode where users don't write his/her own decorated function.

@kumare3
Copy link
Contributor

kumare3 commented Jul 29, 2020

This looks awesome, two major comments

  1. Support for other formats (currently you can keep it at CSV)
  2. Docs to be added

@kumare3 kumare3 self-requested a review July 29, 2020 20:48
@kumare3
Copy link
Contributor

kumare3 commented Jul 29, 2020

Looks good to me, but we should get plugin merged before this

@bnsblue bnsblue merged commit c2e8424 into master Jul 31, 2020
max-hoffman pushed a commit to dolthub/flytekit that referenced this pull request May 11, 2021
… and basic HPOJob support [Alpha] (flyteorg#120)

* adding trainingjob model and sagemaker task

* adding models for sagemaker proto messages

* add new line at eof

* adding common trainingjob task

* redo flytekit changes to comply with new interface and proto definition

* Fix a logic bug in training job model. Adding SdkSimpleTrainingJobTask type

* Add a comment

* Add SdkSimpleHPOJobTask

* Remove the embedding of the underlying trainingjob's output from the hpojob's interface

* fix a typo

* add new line at eof

* adding custom training job sdk type

* add code for tranlating an enum in hpo_job model; fix hpo_job_task sdk task

* missing a colon

* add the missing input stopping_condition for training job tasks

* bump flyteidl version

* bump to a beta version

* fixing unit tests

* fixing unit tests

* replacing interface types

* change

* fixed training job unit test

* fix hpo job task interface and hide task type from users

* fix hpo job task interface

* fix hpo models

* fix serialization of the underlying trainingjob of a hpo job

* Expose training job as a parameter

* Working!

* replacing hyphens with underscores

* updated

* bug fix

* Sagemaker nb

* Sagemaker HPO

* remove .demo directory

* register and launch standalone trainingjob task

* Merge

* adding unit test for SdkSimpleHPOJobTask

* fixing unit tests

* preventing installing numpy==1.19.0 which introduces a breaking change for unit tests

* fix semver

* make changes corresponding to flyteidl changes (renaming hpo to hyperparameter tuning)

* bump beta version

* Delete config.yaml

* make changes to reflect changes in flyteidl

* make task name consistent

* add missing properties for hyperparameter models

* add missing type hints and remove unused imports

* remove unused sdk sagemaker dir

* remove unused test file

* revert numpy semver

* remove type hints for self because CI is using python 3.6.3 while __future__.annotations requires python 3.7

* complete docstrings for hpo job task

* fix unit test

* adding input_file_type (wip)

* add input file type support

* add docs

* reflecting the renamed type and field

* reflecting remove of libsvm content type

* reflecting remove of libsvm content type

* Give metric_definitions a None as the default value because built-in algorithm does not allow custom metrics

* nix a print statement

* nix custom training job for the current release

* rename SdkSimpleTrainingJobTask to SdkBuiltinAlgorithmTrainingJobTask

* revert setup.py dependency

Co-authored-by: Yee Hing Tong <[email protected]>
Co-authored-by: Ketan Umare <[email protected]>
Co-authored-by: Haytham AbuelFutuh <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants