Adding Training image needed for train api #1963

deepanker13 · 2023-12-12T13:04:49Z

What this PR does / why we need it:

Added the training script that will be used in the PyTorch job for train api.
Added the GitHub workflow to build and publish the image on pull request.
Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Partially Fixes Train/Fine-tune API Proposal for LLMs #1945

Checklist:

Docs included if any changes are user facing

coveralls · 2023-12-12T13:12:12Z

Pull Request Test Coverage Report for Build 7493861403

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-0.02%) to 42.885%

Totals
Change from base Build 7491927722:	-0.02%
Covered Lines:	3755
Relevant Lines:	8756

💛 - Coveralls

.github/workflows/publish-sdk-images.yaml

sdk/python/kubeflow/training/training_container/Dockerfile

sdk/python/kubeflow/training/training_container/hf_llm_training.py

.github/workflows/publish-sdk-images.yaml

sdk/python/kubeflow/trainer/hf_dockerfile

deepanker13 · 2024-01-08T16:51:03Z

@andreyvelich @tenzen-y if it is good to go, can we merge this?

examples/sdk/train_api.py

.github/workflows/publish-example-images.yaml

sdk/python/kubeflow/trainer/hf_llm_training.py

tenzen-y

otherwise lgtm

examples/sdk/train_api.py

tenzen-y

Then, can you update the following line?

platforms: ${{ matrix.platforms }}

training-operator/.github/workflows/publish-core-images.yaml

Line 13 in 006dda4

platforms: linux/amd64,linux/arm64,linux/ppc64le

.github/workflows/publish-core-images.yaml

tenzen-y

@deepanker13 Thanks!
/lgtm

/assign @andreyvelich

andreyvelich

Thank you @deepanker13!
I left a few comments

andreyvelich · 2024-01-11T12:57:18Z

sdk/python/kubeflow/trainer/hf_dockerfile

@@ -0,0 +1,18 @@
+# Use an official Pytorch runtime as a parent image
+FROM nvcr.io/nvidia/pytorch:23.12-py3


Do we need to use PyTorch image from NVIDIA for this trainer ?
Would it be better to take official PyTorch image similar to what we use in SDK ?
docker.io/pytorch/pytorch:1.12.1-cuda11.3-cudnn8-runtime

as suggested by @tenzen-y
#1963 (comment)

I see. @tenzen-y Do you know if PyTorch has any official image that we can use that is supported on all platforms ?

@andreyvelich As I remember correctly, the PyTorch doesn't provide images with multiple architecture platforms with GPU. So, we need to use the NVIDIA official images.

sdk/python/kubeflow/trainer/hf_llm_training.py

andreyvelich · 2024-01-11T13:04:03Z

sdk/python/kubeflow/trainer/hf_llm_training.py

+def setup_peft_model(model, lora_config):
+    # Set up the PEFT model
+    lora_config = LoraConfig(**json.loads(lora_config))
+    print(lora_config)
+    model = get_peft_model(model, lora_config)
+    return model


Are we are going to have PEFT config always for this trainer ?
@johnugeorge @deepanker13

loraconfig can be omitted by user, it is handled by setting empty loraconfig as default value in the data class

Sounds good, @deepanker13 Should we verify if lora_config is set ?

sdk/python/kubeflow/trainer/hf_llm_training.py

andreyvelich · 2024-01-11T13:07:59Z

sdk/python/kubeflow/trainer/hf_llm_training.py

+    parser.add_argument("--transformer_type", help="model transformer type")
+    parser.add_argument("--model_dir", help="directory containing model")
+    parser.add_argument("--dataset_dir", help="directory contaning dataset")
+    parser.add_argument("--dataset_name", help="dataset name")


We add dataset_name argument for users who want to use this Trainer without SDK client ?
I am asking because in SDK client we always download dataset in storage initializer and store it in Trainer volume.
So we don't need to provide name.

in the same dataset_dir there can be multiple datasets, right?

But can we use train API to download more than one dataset ?
E.g. in your example, you just download ultrachat_10k dataset.

yes, if I run with a different datasetname, it will work fine.
@andreyvelich

Yeah, but for every API execution you create a new PyTorchJob and a new Trainer image will be spin up.
So dataset is always represent single name, isn't ?

andreyvelich · 2024-01-11T13:08:49Z

examples/sdk/train_api.py

+client.train(
+    name="hf-test",
+    num_workers=2,
+    num_procs_per_worker=0,


Why this value is 0 ?

for cpu only training

Hmm, but can torchrun be used with CPUs ?
E.g. maybe I want to run torchrun --nproc-per-node=2 where I use 2 CPU per node.
cc @johnugeorge

Yes. It can run on cpus.

…lly, adding jupyter notebook

review-notebook-app · 2024-01-11T18:07:51Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

…lready

deepanker13 · 2024-01-11T19:58:57Z

tested gpu training example in examples/sdk/train_api.ipynb

andreyvelich

That's amazing, thank you @deepanker13!
/lgtm
/assign @johnugeorge

johnugeorge · 2024-01-11T21:03:01Z

/approve
Thanks Deepanker

google-oss-prow · 2024-01-11T21:03:22Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deepanker13, johnugeorge

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [johnugeorge]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow bot requested review from jinchihe and kuizhiqing December 12, 2023 13:04

google-oss-prow bot added the size/L label Dec 12, 2023

deepanker13 changed the title ~~Adding training image creation code~~ Adding Training image needed for train api Dec 12, 2023

andreyvelich reviewed Dec 12, 2023

View reviewed changes

sdk/python/kubeflow/training/training_container/hf_llm_training.py Outdated Show resolved Hide resolved

sdk/python/kubeflow/training/training_container/hf_llm_training.py Outdated Show resolved Hide resolved

tenzen-y reviewed Dec 27, 2023

View reviewed changes

.github/workflows/publish-sdk-images.yaml Outdated Show resolved Hide resolved

sdk/python/kubeflow/trainer/hf_dockerfile Show resolved Hide resolved

sdk/python/kubeflow/trainer/hf_dockerfile Outdated Show resolved Hide resolved

deepanker13 commented Jan 9, 2024

View reviewed changes

examples/sdk/train_api.py Outdated Show resolved Hide resolved

tenzen-y reviewed Jan 9, 2024

View reviewed changes

.github/workflows/publish-example-images.yaml Outdated Show resolved Hide resolved

sdk/python/kubeflow/trainer/hf_llm_training.py Outdated Show resolved Hide resolved

sdk/python/kubeflow/trainer/hf_llm_training.py Outdated Show resolved Hide resolved

deepanker13 added 10 commits January 10, 2024 22:38

adding training image creationcode

a2fc0cb

reformatting code using black

3197a94

code review changes

5c2d48c

fixes

c7260da

running workflow on push

61fb7e1

correcting context

94d0543

correcting context

5931a49

fixing trainer code

cf6419e

code review changes

3204149

resolving merge conflict

4642b9e

deepanker13 force-pushed the train_api_training_image branch from 854d6fd to 4642b9e Compare January 10, 2024 17:10

tenzen-y reviewed Jan 10, 2024

View reviewed changes

examples/sdk/train_api.py Outdated Show resolved Hide resolved

examples/sdk/train_api.py Outdated Show resolved Hide resolved

examples/sdk/train_api.py Outdated Show resolved Hide resolved

examples/sdk/train_api.py Outdated Show resolved Hide resolved

tenzen-y reviewed Jan 10, 2024

View reviewed changes

.github/workflows/publish-core-images.yaml Show resolved Hide resolved

.github/workflows/publish-core-images.yaml Show resolved Hide resolved

code review changes

112c581

tenzen-y reviewed Jan 11, 2024

View reviewed changes

google-oss-prow bot assigned andreyvelich and tenzen-y Jan 11, 2024

google-oss-prow bot added the lgtm label Jan 11, 2024

andreyvelich reviewed Jan 11, 2024

View reviewed changes

downgrading pytorch version, removing changes for running things loca…

1034403

…lly, adding jupyter notebook

google-oss-prow bot removed the lgtm label Jan 11, 2024

ci fix and removing unused parameter and adding check if pvc exists a…

f520329

…lready

deepanker13 force-pushed the train_api_training_image branch from 7f76de3 to f520329 Compare January 11, 2024 18:48

gpu training example fix

95b3e2b

andreyvelich reviewed Jan 11, 2024

View reviewed changes

google-oss-prow bot assigned johnugeorge Jan 11, 2024

google-oss-prow bot added the lgtm label Jan 11, 2024

google-oss-prow bot added the approved label Jan 11, 2024

google-oss-prow bot merged commit e10733e into kubeflow:master Jan 11, 2024
35 checks passed

rimolive mentioned this pull request Jan 22, 2024

Training WG roadmap for KF 1.9 kubeflow/manifests#2597

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Training image needed for train api #1963

Adding Training image needed for train api #1963

deepanker13 commented Dec 12, 2023

coveralls commented Dec 12, 2023 •

edited

Loading

deepanker13 commented Jan 8, 2024

tenzen-y left a comment

tenzen-y left a comment

tenzen-y left a comment

andreyvelich left a comment

andreyvelich Jan 11, 2024

deepanker13 Jan 11, 2024

andreyvelich Jan 11, 2024

tenzen-y Jan 12, 2024

andreyvelich Jan 11, 2024

deepanker13 Jan 11, 2024

andreyvelich Jan 11, 2024

andreyvelich Jan 11, 2024

deepanker13 Jan 11, 2024

andreyvelich Jan 11, 2024

deepanker13 Jan 11, 2024

andreyvelich Jan 11, 2024

andreyvelich Jan 11, 2024

deepanker13 Jan 11, 2024

andreyvelich Jan 11, 2024

johnugeorge Jan 11, 2024

review-notebook-app bot commented Jan 11, 2024

deepanker13 commented Jan 11, 2024 •

edited

Loading

andreyvelich left a comment

johnugeorge commented Jan 11, 2024

google-oss-prow bot commented Jan 11, 2024

		@@ -0,0 +1,18 @@
		# Use an official Pytorch runtime as a parent image
		FROM nvcr.io/nvidia/pytorch:23.12-py3

Adding Training image needed for train api #1963

Adding Training image needed for train api #1963

Conversation

deepanker13 commented Dec 12, 2023

coveralls commented Dec 12, 2023 • edited Loading

Pull Request Test Coverage Report for Build 7493861403

💛 - Coveralls

deepanker13 commented Jan 8, 2024

tenzen-y left a comment

Choose a reason for hiding this comment

tenzen-y left a comment

Choose a reason for hiding this comment

tenzen-y left a comment

Choose a reason for hiding this comment

andreyvelich left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

review-notebook-app bot commented Jan 11, 2024

deepanker13 commented Jan 11, 2024 • edited Loading

andreyvelich left a comment

Choose a reason for hiding this comment

johnugeorge commented Jan 11, 2024

google-oss-prow bot commented Jan 11, 2024

coveralls commented Dec 12, 2023 •

edited

Loading

deepanker13 commented Jan 11, 2024 •

edited

Loading