Identify the action items for adapting custom training in HPO #467

bnsblue · 2020-08-10T18:08:27Z

Adapting the underlying trainingjob's interface (I/O)
In Flytekit, the HyperparameterTuningJobTask should copy the container from its underlying CustomTrainingJobTask
In Flytekit, the HyperparameterTuningJobTask should copy the timeout value from its underlying CustomTrainingJobTask
In Flytekit, the outputs of the underlying CustomTrainingJobTask should be embedded into the output interface of HyperparameterTuningJobTask
In Flytekit, we need to prepare an alternative pyflyte-execute script that take the SageMaker-generated HPO values and merge them with the content of inputs.pb
In Flyteplugins, modify SageMaker HyperparameterTuningJob flyteplugin to correct recognize HPO jobs with different types of underlying training job and adapt correspondingly to generate the correct CRD

The text was updated successfully, but these errors were encountered:

bnsblue · 2020-08-10T21:10:21Z

I created an HPOJob with the following CRD

apiVersion: sagemaker.aws.amazon.com/v1
kind: HyperparameterTuningJob
metadata:
  name: sm-custom-hpo
spec:
  region: us-east-1
  tags:
    - key: test-key
      value: test-value
  hyperParameterTuningJobConfig:
    strategy: Bayesian
    hyperParameterTuningJobObjective:
      type: Minimize
      metricName: validation:error
    resourceLimits:
      maxNumberOfTrainingJobs: 10
      maxParallelTrainingJobs: 5
    parameterRanges:
      integerParameterRanges:
      - name: num_round
        minValue: '10'
        maxValue: '20'
        scalingType: Linear
      continuousParameterRanges: []
      categoricalParameterRanges: []
    trainingJobEarlyStoppingType: Auto
  trainingJobDefinition:
    staticHyperParameters:
      - name: __FLYTE_ENTRYPOINT_SELECTOR__
        value: "SAGEMAKER"
      - name: base_score
        value: '0.5'
      - name: booster
        value: gbtree
      - name: csv_weights
        value: '0'
      - name: dsplit
        value: row
      - name: grow_policy
        value: depthwise
      - name: lambda_bias
        value: '0.0'
      - name: max_bin
        value: '256'
      - name: max_leaves
        value: '0'
      - name: normalize_type
        value: tree
      - name: objective
        value: reg:linear
      - name: one_drop
        value: '0'
      - name: prob_buffer_row
        value: '1.0'
      - name: process_type
        value: default
      - name: rate_drop
        value: '0.0'
      - name: refresh_leaf
        value: '1'
      - name: sample_type
        value: uniform
      - name: scale_pos_weight
        value: '1.0'
      - name: silent
        value: '0'
      - name: sketch_eps
        value: '0.03'
      - name: skip_drop
        value: '0.0'
      - name: tree_method
        value: auto
      - name: tweedie_variance_power
        value: '1.5'
      - name: updater
        value: grow_colmaker,prune
    algorithmSpecification:
      trainingImage: <image>
      trainingInputMode: File
      metricDefinitions:
      - name: validation:error
        regex: 'validation error'
    roleArn: <roleARN>
    inputDataConfig:
    - channelName: train
      dataSource:
        s3DataSource:
         s3DataType: S3Prefix
          s3Uri: <s3_path>
          s3DataDistributionType: FullyReplicated
      contentType: text/csv
      compressionType: None
      recordWrapperType: None
      inputMode: File
    - channelName: validation
      dataSource:
        s3DataSource:
          s3DataType: S3Prefix
          s3Uri: <s3_path>
          s3DataDistributionType: FullyReplicated
      contentType: text/csv
      compressionType: None
      recordWrapperType: None
      inputMode: File
    outputDataConfig:
      s3OutputPath: s3://my-bucket/xgboost
    resourceConfig:
      instanceType: ml.m4.xlarge
      instanceCount: 1
      volumeSizeInGB: 25
    stoppingCondition:
      maxRuntimeInSeconds: 3600
    enableNetworkIsolation: true
    enableInterContainerTrafficEncryption: false

and this is the log of one of the underlying trainingjob https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Faws$252Fsagemaker$252FTrainingJobs/log-events/9c6aeeed28af4dd48d0ff14af57ec168-005-1db03577$252Falgo-1-1597092293

In which it can be easily seen that SageMaker still uses the same way to pass hyperparameters to the underlying training job

SM_USER_ARGS=["--__FLYTE_ENTRYPOINT_SELECTOR__","SAGEMAKER","--base_score","0.5","--booster","gbtree","--csv_weights","0","--dsplit","row","--grow_policy","depthwise","--lambda_bias","0.0","--max_bin","256","--max_leaves","0","--normalize_type","tree","--num_round","14","--objective","reg:linear","--one_drop","0","--prob_buffer_row","1.0","--process_type","default","--rate_drop","0.0","--refresh_leaf","1","--sample_type","uniform","--scale_pos_weight","1.0","--silent","0","--sketch_eps","0.03","--skip_drop","0.0","--tree_method","auto","--tweedie_variance_power","1.5","--updater","grow_colmaker,prune"]
...
SM_HP___FLYTE_ENTRYPOINT_SELECTOR__=SAGEMAKER
...

Invoking script with the following command:
/usr/bin/python3 flyte_entrypoint_selector.py --__FLYTE_ENTRYPOINT_SELECTOR__ SAGEMAKER --base_score 0.5 --booster gbtree --csv_weights 0 --dsplit row --grow_policy depthwise --lambda_bias 0.0 --max_bin 256 --max_leaves 0 --normalize_type tree --num_round 14 --objective reg:linear --one_drop 0 --prob_buffer_row 1.0 --process_type default --rate_drop 0.0 --refresh_leaf 1 --sample_type uniform --scale_pos_weight 1.0 --silent 0 --sketch_eps 0.03 --skip_drop 0.0 --tree_method auto --tweedie_variance_power 1.5 --updater grow_colmaker,prune

This just confirms that the training job underlying a hpo job also uses the same interface as that of a standalone training job.

Ok, now we will have to figure out how to pass them to user's python function... are hyperparameter inputs? if so we will have to download and manipulate the inputs?.

What SageMaker does is that it will put a summarized map of hyperparameters and values to the path /opt/ml/input/config/hyperparameters.json inside your container, and their wrapper script parses that file and passes the hyperparameters to the user script as command-line arguments

bnsblue · 2020-08-10T21:15:54Z

I found a potential problem when doing the experiment:
In hpo with custom training jobs, if the metrics are not well defined, the training job could hang and never end.

‼️ This may lead to potential operational difficulty and extra cost if not handled carefully.

* Update flyteidl version Signed-off-by: Flyte-Bot <[email protected]> * Update flyteidl version Signed-off-by: Flyte-Bot <[email protected]> * Fix build break Signed-off-by: Haytham Abuelfutuh <[email protected]> * Update flyteidl version Signed-off-by: Flyte-Bot <[email protected]> * Update flyteidl version Signed-off-by: Flyte-Bot <[email protected]> * Update flyteidl version Signed-off-by: Flyte-Bot <[email protected]> Signed-off-by: Flyte-Bot <[email protected]> Signed-off-by: Haytham Abuelfutuh <[email protected]> Co-authored-by: flyte-bot <[email protected]> Co-authored-by: Haytham Abuelfutuh <[email protected]>

Signed-off-by: Yee Hing Tong <[email protected]>

…ion for child task execution (flyteorg#467)

* Update flyteidl version Signed-off-by: Flyte-Bot <[email protected]> * Update flyteidl version Signed-off-by: Flyte-Bot <[email protected]> * Fix build break Signed-off-by: Haytham Abuelfutuh <[email protected]> * Update flyteidl version Signed-off-by: Flyte-Bot <[email protected]> * Update flyteidl version Signed-off-by: Flyte-Bot <[email protected]> * Update flyteidl version Signed-off-by: Flyte-Bot <[email protected]> Signed-off-by: Flyte-Bot <[email protected]> Signed-off-by: Haytham Abuelfutuh <[email protected]> Co-authored-by: flyte-bot <[email protected]> Co-authored-by: Haytham Abuelfutuh <[email protected]>

github-actions · 2023-08-26T00:36:56Z

Hello 👋, This issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will close the issue if we detect no activity in the next 7 days. Thank you for your contribution and understanding! 🙏

github-actions · 2023-09-02T00:37:34Z

Hello 👋, This issue has been inactive for over 9 months and hasn't received any updates since it was marked as stale. We'll be closing this issue for now, but if you believe this issue is still relevant, please feel free to reopen it. Thank you for your contribution and understanding! 🙏

Signed-off-by: troychiu <[email protected]>

github-actions · 2024-07-30T00:07:36Z

Hello 👋, this issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will engage on it to decide if it is still applicable.
Thank you for your contribution and understanding! 🙏

bnsblue self-assigned this Aug 10, 2020

bnsblue added this to the 0.7.0 milestone Aug 10, 2020

bnsblue added the enhancement New feature or request label Aug 10, 2020

kumare3 modified the milestones: 0.7.0, 0.8.0 Sep 2, 2020

anandswaminathan modified the milestones: 0.8.0, 0.9.0 Sep 30, 2020

EngHabu modified the milestones: 0.9.0, 0.10.0 Nov 4, 2020

EngHabu removed this from the 0.10.0 milestone Jan 11, 2021

kumare3 unassigned bnsblue Apr 23, 2021

eapolinario pushed a commit to eapolinario/flyte that referenced this issue Dec 20, 2022

Update to top-line imports (flyteorg#467)

65fa070

Signed-off-by: Yee Hing Tong <[email protected]>

eapolinario pushed a commit to eapolinario/flyte that referenced this issue Dec 20, 2022

[Mapping][TaskInfo] V.2 - Update Task details to allow check informat…

747b100

…ion for child task execution (flyteorg#467)

github-actions bot added the stale label Aug 26, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 2, 2023

eapolinario reopened this Nov 2, 2023

github-actions bot removed the stale label Nov 3, 2023

eapolinario pushed a commit to eapolinario/flyte that referenced this issue Apr 30, 2024

Update admin endpoint to use port 30080 in config.yaml (flyteorg#467)

848ab01

Signed-off-by: troychiu <[email protected]>

eapolinario pushed a commit to eapolinario/flyte that referenced this issue Apr 30, 2024

Update admin endpoint to use port 30080 in config.yaml (flyteorg#467)

44c55f6

Signed-off-by: troychiu <[email protected]>

austin362667 pushed a commit to austin362667/flyte that referenced this issue May 7, 2024

Update admin endpoint to use port 30080 in config.yaml (flyteorg#467)

84d7dca

Signed-off-by: troychiu <[email protected]>

robert-ulbrich-mercedes-benz pushed a commit to robert-ulbrich-mercedes-benz/flyte that referenced this issue Jul 2, 2024

Update admin endpoint to use port 30080 in config.yaml (flyteorg#467)

055ace7

Signed-off-by: troychiu <[email protected]>

github-actions bot added the stale label Jul 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identify the action items for adapting custom training in HPO #467

Identify the action items for adapting custom training in HPO #467

bnsblue commented Aug 10, 2020 •

edited

Loading

bnsblue commented Aug 10, 2020

bnsblue commented Aug 10, 2020

github-actions bot commented Aug 26, 2023

github-actions bot commented Sep 2, 2023

github-actions bot commented Jul 30, 2024

Identify the action items for adapting custom training in HPO #467

Identify the action items for adapting custom training in HPO #467

Comments

bnsblue commented Aug 10, 2020 • edited Loading

bnsblue commented Aug 10, 2020

Ok, now we will have to figure out how to pass them to user's python function... are hyperparameter inputs? if so we will have to download and manipulate the inputs?.

bnsblue commented Aug 10, 2020

github-actions bot commented Aug 26, 2023

github-actions bot commented Sep 2, 2023

github-actions bot commented Jul 30, 2024

bnsblue commented Aug 10, 2020 •

edited

Loading