[SDK] Fix trainer error: Update the version of base image and add "num_labels" for downloading pretrained models #2230

helenxie-bit · 2024-08-20T14:22:46Z

What this PR does / why we need it:
This PR fixes the error encountered when training using the train API in a CPU environment by updating the trainer's base image version and adding the "num_labels" attribute to HuggingFaceModelParams for downloading pretrained models.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #2228

Checklist:

Signed-off-by: helenxie-bit <[email protected]>

coveralls · 2024-08-21T03:11:05Z

Pull Request Test Coverage Report for Build 10586866599

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

0 of 0 changed or added relevant lines in 0 files are covered.
6 unchanged lines in 1 file lost coverage.
Overall coverage decreased (-1.7%) to 31.745%

Files with Coverage Reduction	New Missed Lines	%
pkg/controller.v1/mpi/mpijob_controller.go	6	80.48%

Totals
Change from base Build 10408646954:	-1.7%
Covered Lines:	3943
Relevant Lines:	12421

💛 - Coveralls

helenxie-bit · 2024-08-21T04:00:04Z

@andreyvelich PTAL 👀, thanks!

andreyvelich

Thanks for this @helenxie-bit!
Please update the example as well: https://github.com/kubeflow/training-operator/blob/c0406d43b407ac86ec134eae6a3d19bba55ad1df/examples/pytorch/text-classification/Fine-Tune-BERT-LLM.ipynb

/assign @deepanker13 @johnugeorge

sdk/python/kubeflow/trainer/hf_llm_training.py

sdk/python/kubeflow/storage_initializer/hugging_face.py

Signed-off-by: helenxie-bit <[email protected]>

sdk/python/kubeflow/trainer/hf_llm_training.py

Signed-off-by: helenxie-bit <[email protected]>

helenxie-bit · 2024-08-27T23:13:23Z

@andreyvelich I have made the adjustment and please review when you have time. Thanks!

andreyvelich

Thanks for the update @helenxie-bit!
/lgtm
/approve

google-oss-prow · 2024-08-28T16:23:16Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~sdk/python/OWNERS~~ [andreyvelich]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…m_labels" for downloading pretrained models (kubeflow#2230) * fix trainer error Signed-off-by: helenxie-bit <[email protected]> * rerun tests Signed-off-by: helenxie-bit <[email protected]> * update the process of num_labels in trainer Signed-off-by: helenxie-bit <[email protected]> * rerun tests Signed-off-by: helenxie-bit <[email protected]> * adjust the default value of 'num_labels' Signed-off-by: helenxie-bit <[email protected]> --------- Signed-off-by: helenxie-bit <[email protected]>

…m_labels" for downloading pretrained models (kubeflow#2230) * fix trainer error Signed-off-by: helenxie-bit <[email protected]> * rerun tests Signed-off-by: helenxie-bit <[email protected]> * update the process of num_labels in trainer Signed-off-by: helenxie-bit <[email protected]> * rerun tests Signed-off-by: helenxie-bit <[email protected]> * adjust the default value of 'num_labels' Signed-off-by: helenxie-bit <[email protected]> --------- Signed-off-by: helenxie-bit <[email protected]> Signed-off-by: Andrey Velichkevich <[email protected]>

* Update `huggingface_hub` Version in the storage initializer to fix ImportError (#2180) Signed-off-by: helenxie-bit <[email protected]> Signed-off-by: Andrey Velichkevich <[email protected]> * [SDK] Fix trainer error: Update the version of base image and add "num_labels" for downloading pretrained models (#2230) * fix trainer error Signed-off-by: helenxie-bit <[email protected]> * rerun tests Signed-off-by: helenxie-bit <[email protected]> * update the process of num_labels in trainer Signed-off-by: helenxie-bit <[email protected]> * rerun tests Signed-off-by: helenxie-bit <[email protected]> * adjust the default value of 'num_labels' Signed-off-by: helenxie-bit <[email protected]> --------- Signed-off-by: helenxie-bit <[email protected]> Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: helenxie-bit <[email protected]> Signed-off-by: Andrey Velichkevich <[email protected]> Co-authored-by: Hezhi Xie <[email protected]> Co-authored-by: Hezhi (Helen) Xie <[email protected]>

fix trainer error

5b0f796

Signed-off-by: helenxie-bit <[email protected]>

google-oss-prow bot requested review from jinchihe and kuizhiqing August 20, 2024 14:22

google-oss-prow bot added the size/M label Aug 20, 2024

rerun tests

c0406d4

Signed-off-by: helenxie-bit <[email protected]>

andreyvelich reviewed Aug 21, 2024

View reviewed changes

sdk/python/kubeflow/trainer/hf_llm_training.py Outdated Show resolved Hide resolved

sdk/python/kubeflow/storage_initializer/hugging_face.py Show resolved Hide resolved

google-oss-prow bot assigned deepanker13 and johnugeorge Aug 21, 2024

helenxie-bit added 2 commits August 22, 2024 03:11

update the process of num_labels in trainer

b9dd592

Signed-off-by: helenxie-bit <[email protected]>

rerun tests

56f112b

Signed-off-by: helenxie-bit <[email protected]>

andreyvelich reviewed Aug 26, 2024

View reviewed changes

sdk/python/kubeflow/trainer/hf_llm_training.py Outdated Show resolved Hide resolved

adjust the default value of 'num_labels'

f7ef520

Signed-off-by: helenxie-bit <[email protected]>

andreyvelich reviewed Aug 28, 2024

View reviewed changes

google-oss-prow bot assigned andreyvelich Aug 28, 2024

google-oss-prow bot added the lgtm label Aug 28, 2024

google-oss-prow bot added the approved label Aug 28, 2024

google-oss-prow bot merged commit e9766d1 into kubeflow:master Aug 28, 2024
39 checks passed

tenzen-y mentioned this pull request Aug 29, 2024

[Release] Training operator 1.8.1 release #2241

Closed

4 tasks

andreyvelich mentioned this pull request Aug 29, 2024

Cherry pick of #2180 #2230 into v1.8-branch #2242

Merged

helenxie-bit mentioned this pull request Sep 2, 2024

[GSoC] Project 4: Hyperparameter Optimization API in Katib for LLMs kubeflow/katib#2339

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SDK] Fix trainer error: Update the version of base image and add "num_labels" for downloading pretrained models #2230

[SDK] Fix trainer error: Update the version of base image and add "num_labels" for downloading pretrained models #2230

helenxie-bit commented Aug 20, 2024

coveralls commented Aug 21, 2024 •

edited

Loading

helenxie-bit commented Aug 21, 2024

andreyvelich left a comment

helenxie-bit commented Aug 27, 2024

andreyvelich left a comment

google-oss-prow bot commented Aug 28, 2024

[SDK] Fix trainer error: Update the version of base image and add "num_labels" for downloading pretrained models #2230

[SDK] Fix trainer error: Update the version of base image and add "num_labels" for downloading pretrained models #2230

Conversation

helenxie-bit commented Aug 20, 2024

coveralls commented Aug 21, 2024 • edited Loading

Pull Request Test Coverage Report for Build 10586866599

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

helenxie-bit commented Aug 21, 2024

andreyvelich left a comment

Choose a reason for hiding this comment

helenxie-bit commented Aug 27, 2024

andreyvelich left a comment

Choose a reason for hiding this comment

google-oss-prow bot commented Aug 28, 2024

coveralls commented Aug 21, 2024 •

edited

Loading