-
Notifications
You must be signed in to change notification settings - Fork 700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Training image needed for train api #1963
Adding Training image needed for train api #1963
Conversation
Pull Request Test Coverage Report for Build 7493861403
💛 - Coveralls |
sdk/python/kubeflow/training/training_container/hf_llm_training.py
Outdated
Show resolved
Hide resolved
sdk/python/kubeflow/training/training_container/hf_llm_training.py
Outdated
Show resolved
Hide resolved
sdk/python/kubeflow/training/training_container/hf_llm_training.py
Outdated
Show resolved
Hide resolved
@andreyvelich @tenzen-y if it is good to go, can we merge this? |
854d6fd
to
4642b9e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
otherwise lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then, can you update the following line?
platforms: ${{ matrix.platforms }}
platforms: linux/amd64,linux/arm64,linux/ppc64le |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deepanker13 Thanks!
/lgtm
/assign @andreyvelich
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @deepanker13!
I left a few comments
@@ -0,0 +1,18 @@ | |||
# Use an official Pytorch runtime as a parent image | |||
FROM nvcr.io/nvidia/pytorch:23.12-py3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to use PyTorch image from NVIDIA for this trainer ?
Would it be better to take official PyTorch image similar to what we use in SDK ?
docker.io/pytorch/pytorch:1.12.1-cuda11.3-cudnn8-runtime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as suggested by @tenzen-y
#1963 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. @tenzen-y Do you know if PyTorch has any official image that we can use that is supported on all platforms ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich As I remember correctly, the PyTorch doesn't provide images with multiple architecture platforms with GPU. So, we need to use the NVIDIA official images.
def setup_peft_model(model, lora_config): | ||
# Set up the PEFT model | ||
lora_config = LoraConfig(**json.loads(lora_config)) | ||
print(lora_config) | ||
model = get_peft_model(model, lora_config) | ||
return model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we are going to have PEFT config always for this trainer ?
@johnugeorge @deepanker13
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
loraconfig can be omitted by user, it is handled by setting empty loraconfig as default value in the data class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, @deepanker13 Should we verify if lora_config is set ?
parser.add_argument("--transformer_type", help="model transformer type") | ||
parser.add_argument("--model_dir", help="directory containing model") | ||
parser.add_argument("--dataset_dir", help="directory contaning dataset") | ||
parser.add_argument("--dataset_name", help="dataset name") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We add dataset_name argument for users who want to use this Trainer without SDK client ?
I am asking because in SDK client we always download dataset in storage initializer and store it in Trainer volume.
So we don't need to provide name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the same dataset_dir there can be multiple datasets, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But can we use train
API to download more than one dataset ?
E.g. in your example, you just download ultrachat_10k
dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, if I run with a different datasetname, it will work fine.
@andreyvelich
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but for every API execution you create a new PyTorchJob and a new Trainer image will be spin up.
So dataset is always represent single name, isn't ?
examples/sdk/train_api.py
Outdated
client.train( | ||
name="hf-test", | ||
num_workers=2, | ||
num_procs_per_worker=0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this value is 0 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for cpu only training
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, but can torchrun
be used with CPUs ?
E.g. maybe I want to run torchrun --nproc-per-node=2
where I use 2 CPU per node.
cc @johnugeorge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It can run on cpus.
…lly, adding jupyter notebook
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
7f76de3
to
f520329
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's amazing, thank you @deepanker13!
/lgtm
/assign @johnugeorge
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deepanker13, johnugeorge The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What this PR does / why we need it:
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...
format, will close the issue(s) when PR gets merged):Partially Fixes Train/Fine-tune API Proposal for LLMs #1945
Checklist: