Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SDK] Fix trainer error: Update the version of base image and add "num_labels" for downloading pretrained models #2230

Merged
merged 5 commits into from
Aug 28, 2024

Conversation

helenxie-bit
Copy link
Contributor

What this PR does / why we need it:
This PR fixes the error encountered when training using the train API in a CPU environment by updating the trainer's base image version and adding the "num_labels" attribute to HuggingFaceModelParams for downloading pretrained models.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #2228

Checklist:

Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
@coveralls
Copy link

coveralls commented Aug 21, 2024

Pull Request Test Coverage Report for Build 10586866599

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 6 unchanged lines in 1 file lost coverage.
  • Overall coverage decreased (-1.7%) to 31.745%

Files with Coverage Reduction New Missed Lines %
pkg/controller.v1/mpi/mpijob_controller.go 6 80.48%
Totals Coverage Status
Change from base Build 10408646954: -1.7%
Covered Lines: 3943
Relevant Lines: 12421

💛 - Coveralls

@helenxie-bit
Copy link
Contributor Author

@andreyvelich PTAL 👀, thanks!

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sdk/python/kubeflow/trainer/hf_llm_training.py Outdated Show resolved Hide resolved
@helenxie-bit
Copy link
Contributor Author

@andreyvelich I have made the adjustment and please review when you have time. Thanks!

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @helenxie-bit!
/lgtm
/approve

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit e9766d1 into kubeflow:master Aug 28, 2024
39 checks passed
andreyvelich pushed a commit to andreyvelich/training-operator that referenced this pull request Aug 29, 2024
…m_labels" for downloading pretrained models (kubeflow#2230)

* fix trainer error

Signed-off-by: helenxie-bit <[email protected]>

* rerun tests

Signed-off-by: helenxie-bit <[email protected]>

* update the process of num_labels in trainer

Signed-off-by: helenxie-bit <[email protected]>

* rerun tests

Signed-off-by: helenxie-bit <[email protected]>

* adjust the  default value of 'num_labels'

Signed-off-by: helenxie-bit <[email protected]>

---------

Signed-off-by: helenxie-bit <[email protected]>
andreyvelich pushed a commit to andreyvelich/training-operator that referenced this pull request Aug 29, 2024
…m_labels" for downloading pretrained models (kubeflow#2230)

* fix trainer error

Signed-off-by: helenxie-bit <[email protected]>

* rerun tests

Signed-off-by: helenxie-bit <[email protected]>

* update the process of num_labels in trainer

Signed-off-by: helenxie-bit <[email protected]>

* rerun tests

Signed-off-by: helenxie-bit <[email protected]>

* adjust the  default value of 'num_labels'

Signed-off-by: helenxie-bit <[email protected]>

---------

Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
google-oss-prow bot pushed a commit that referenced this pull request Aug 29, 2024
* Update `huggingface_hub` Version in the storage initializer to fix ImportError (#2180)

Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>

* [SDK] Fix trainer error: Update the version of base image and add "num_labels" for downloading pretrained models (#2230)

* fix trainer error

Signed-off-by: helenxie-bit <[email protected]>

* rerun tests

Signed-off-by: helenxie-bit <[email protected]>

* update the process of num_labels in trainer

Signed-off-by: helenxie-bit <[email protected]>

* rerun tests

Signed-off-by: helenxie-bit <[email protected]>

* adjust the  default value of 'num_labels'

Signed-off-by: helenxie-bit <[email protected]>

---------

Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Co-authored-by: Hezhi Xie <[email protected]>
Co-authored-by: Hezhi (Helen) Xie <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[SDK] Training failure in CPU Environment: AttributeError of "torch.cpu" and target label out of bounds
5 participants