Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify the action items for adapting custom training in HPO #467

Open
1 of 6 tasks
bnsblue opened this issue Aug 10, 2020 · 5 comments
Open
1 of 6 tasks

Identify the action items for adapting custom training in HPO #467

bnsblue opened this issue Aug 10, 2020 · 5 comments
Labels
enhancement New feature or request stale

Comments

@bnsblue
Copy link
Contributor

bnsblue commented Aug 10, 2020

  • Adapting the underlying trainingjob's interface (I/O)
  • In Flytekit, the HyperparameterTuningJobTask should copy the container from its underlying CustomTrainingJobTask
  • In Flytekit, the HyperparameterTuningJobTask should copy the timeout value from its underlying CustomTrainingJobTask
  • In Flytekit, the outputs of the underlying CustomTrainingJobTask should be embedded into the output interface of HyperparameterTuningJobTask
  • In Flytekit, we need to prepare an alternative pyflyte-execute script that take the SageMaker-generated HPO values and merge them with the content of inputs.pb
  • In Flyteplugins, modify SageMaker HyperparameterTuningJob flyteplugin to correct recognize HPO jobs with different types of underlying training job and adapt correspondingly to generate the correct CRD
@bnsblue bnsblue self-assigned this Aug 10, 2020
@bnsblue bnsblue added this to the 0.7.0 milestone Aug 10, 2020
@bnsblue bnsblue added the enhancement New feature or request label Aug 10, 2020
@bnsblue
Copy link
Contributor Author

bnsblue commented Aug 10, 2020

I created an HPOJob with the following CRD

apiVersion: sagemaker.aws.amazon.com/v1
kind: HyperparameterTuningJob
metadata:
  name: sm-custom-hpo
spec:
  region: us-east-1
  tags:
    - key: test-key
      value: test-value
  hyperParameterTuningJobConfig:
    strategy: Bayesian
    hyperParameterTuningJobObjective:
      type: Minimize
      metricName: validation:error
    resourceLimits:
      maxNumberOfTrainingJobs: 10
      maxParallelTrainingJobs: 5
    parameterRanges:
      integerParameterRanges:
      - name: num_round
        minValue: '10'
        maxValue: '20'
        scalingType: Linear
      continuousParameterRanges: []
      categoricalParameterRanges: []
    trainingJobEarlyStoppingType: Auto
  trainingJobDefinition:
    staticHyperParameters:
      - name: __FLYTE_ENTRYPOINT_SELECTOR__
        value: "SAGEMAKER"
      - name: base_score
        value: '0.5'
      - name: booster
        value: gbtree
      - name: csv_weights
        value: '0'
      - name: dsplit
        value: row
      - name: grow_policy
        value: depthwise
      - name: lambda_bias
        value: '0.0'
      - name: max_bin
        value: '256'
      - name: max_leaves
        value: '0'
      - name: normalize_type
        value: tree
      - name: objective
        value: reg:linear
      - name: one_drop
        value: '0'
      - name: prob_buffer_row
        value: '1.0'
      - name: process_type
        value: default
      - name: rate_drop
        value: '0.0'
      - name: refresh_leaf
        value: '1'
      - name: sample_type
        value: uniform
      - name: scale_pos_weight
        value: '1.0'
      - name: silent
        value: '0'
      - name: sketch_eps
        value: '0.03'
      - name: skip_drop
        value: '0.0'
      - name: tree_method
        value: auto
      - name: tweedie_variance_power
        value: '1.5'
      - name: updater
        value: grow_colmaker,prune
    algorithmSpecification:
      trainingImage: <image>
      trainingInputMode: File
      metricDefinitions:
      - name: validation:error
        regex: 'validation error'
    roleArn: <roleARN>
    inputDataConfig:
    - channelName: train
      dataSource:
        s3DataSource:
         s3DataType: S3Prefix
          s3Uri: <s3_path>
          s3DataDistributionType: FullyReplicated
      contentType: text/csv
      compressionType: None
      recordWrapperType: None
      inputMode: File
    - channelName: validation
      dataSource:
        s3DataSource:
          s3DataType: S3Prefix
          s3Uri: <s3_path>
          s3DataDistributionType: FullyReplicated
      contentType: text/csv
      compressionType: None
      recordWrapperType: None
      inputMode: File
    outputDataConfig:
      s3OutputPath: s3://my-bucket/xgboost
    resourceConfig:
      instanceType: ml.m4.xlarge
      instanceCount: 1
      volumeSizeInGB: 25
    stoppingCondition:
      maxRuntimeInSeconds: 3600
    enableNetworkIsolation: true
    enableInterContainerTrafficEncryption: false

and this is the log of one of the underlying trainingjob https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Faws$252Fsagemaker$252FTrainingJobs/log-events/9c6aeeed28af4dd48d0ff14af57ec168-005-1db03577$252Falgo-1-1597092293

In which it can be easily seen that SageMaker still uses the same way to pass hyperparameters to the underlying training job

SM_USER_ARGS=["--__FLYTE_ENTRYPOINT_SELECTOR__","SAGEMAKER","--base_score","0.5","--booster","gbtree","--csv_weights","0","--dsplit","row","--grow_policy","depthwise","--lambda_bias","0.0","--max_bin","256","--max_leaves","0","--normalize_type","tree","--num_round","14","--objective","reg:linear","--one_drop","0","--prob_buffer_row","1.0","--process_type","default","--rate_drop","0.0","--refresh_leaf","1","--sample_type","uniform","--scale_pos_weight","1.0","--silent","0","--sketch_eps","0.03","--skip_drop","0.0","--tree_method","auto","--tweedie_variance_power","1.5","--updater","grow_colmaker,prune"]
...
SM_HP___FLYTE_ENTRYPOINT_SELECTOR__=SAGEMAKER
...

Invoking script with the following command:
/usr/bin/python3 flyte_entrypoint_selector.py --__FLYTE_ENTRYPOINT_SELECTOR__ SAGEMAKER --base_score 0.5 --booster gbtree --csv_weights 0 --dsplit row --grow_policy depthwise --lambda_bias 0.0 --max_bin 256 --max_leaves 0 --normalize_type tree --num_round 14 --objective reg:linear --one_drop 0 --prob_buffer_row 1.0 --process_type default --rate_drop 0.0 --refresh_leaf 1 --sample_type uniform --scale_pos_weight 1.0 --silent 0 --sketch_eps 0.03 --skip_drop 0.0 --tree_method auto --tweedie_variance_power 1.5 --updater grow_colmaker,prune

This just confirms that the training job underlying a hpo job also uses the same interface as that of a standalone training job.

Ok, now we will have to figure out how to pass them to user's python function... are hyperparameter inputs? if so we will have to download and manipulate the inputs?.

What SageMaker does is that it will put a summarized map of hyperparameters and values to the path /opt/ml/input/config/hyperparameters.json inside your container, and their wrapper script parses that file and passes the hyperparameters to the user script as command-line arguments

@bnsblue
Copy link
Contributor Author

bnsblue commented Aug 10, 2020

I found a potential problem when doing the experiment:
In hpo with custom training jobs, if the metrics are not well defined, the training job could hang and never end.

‼️ This may lead to potential operational difficulty and extra cost if not handled carefully.

@kumare3 kumare3 modified the milestones: 0.7.0, 0.8.0 Sep 2, 2020
@anandswaminathan anandswaminathan modified the milestones: 0.8.0, 0.9.0 Sep 30, 2020
@EngHabu EngHabu modified the milestones: 0.9.0, 0.10.0 Nov 4, 2020
@EngHabu EngHabu removed this from the 0.10.0 milestone Jan 11, 2021
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Dec 6, 2022
* Update flyteidl version

Signed-off-by: Flyte-Bot <[email protected]>

* Update flyteidl version

Signed-off-by: Flyte-Bot <[email protected]>

* Fix build break

Signed-off-by: Haytham Abuelfutuh <[email protected]>

* Update flyteidl version

Signed-off-by: Flyte-Bot <[email protected]>

* Update flyteidl version

Signed-off-by: Flyte-Bot <[email protected]>

* Update flyteidl version

Signed-off-by: Flyte-Bot <[email protected]>

Signed-off-by: Flyte-Bot <[email protected]>
Signed-off-by: Haytham Abuelfutuh <[email protected]>
Co-authored-by: flyte-bot <[email protected]>
Co-authored-by: Haytham Abuelfutuh <[email protected]>
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Dec 20, 2022
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Dec 20, 2022
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Aug 9, 2023
* Update flyteidl version

Signed-off-by: Flyte-Bot <[email protected]>

* Update flyteidl version

Signed-off-by: Flyte-Bot <[email protected]>

* Fix build break

Signed-off-by: Haytham Abuelfutuh <[email protected]>

* Update flyteidl version

Signed-off-by: Flyte-Bot <[email protected]>

* Update flyteidl version

Signed-off-by: Flyte-Bot <[email protected]>

* Update flyteidl version

Signed-off-by: Flyte-Bot <[email protected]>

Signed-off-by: Flyte-Bot <[email protected]>
Signed-off-by: Haytham Abuelfutuh <[email protected]>
Co-authored-by: flyte-bot <[email protected]>
Co-authored-by: Haytham Abuelfutuh <[email protected]>
@github-actions
Copy link

Hello 👋, This issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will close the issue if we detect no activity in the next 7 days. Thank you for your contribution and understanding! 🙏

@github-actions github-actions bot added the stale label Aug 26, 2023
@github-actions
Copy link

github-actions bot commented Sep 2, 2023

Hello 👋, This issue has been inactive for over 9 months and hasn't received any updates since it was marked as stale. We'll be closing this issue for now, but if you believe this issue is still relevant, please feel free to reopen it. Thank you for your contribution and understanding! 🙏

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 2, 2023
@eapolinario eapolinario reopened this Nov 2, 2023
@github-actions github-actions bot removed the stale label Nov 3, 2023
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Apr 30, 2024
eapolinario pushed a commit to eapolinario/flyte that referenced this issue Apr 30, 2024
austin362667 pushed a commit to austin362667/flyte that referenced this issue May 7, 2024
robert-ulbrich-mercedes-benz pushed a commit to robert-ulbrich-mercedes-benz/flyte that referenced this issue Jul 2, 2024
Copy link

Hello 👋, this issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will engage on it to decide if it is still applicable.
Thank you for your contribution and understanding! 🙏

@github-actions github-actions bot added the stale label Jul 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale
Projects
None yet
Development

No branches or pull requests

5 participants