Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure proxy and server images are only provided with --use-pathways. #100

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

RoshaniN
Copy link
Collaborator

@RoshaniN RoshaniN commented Mar 27, 2024

Fixes / Features

  • Remove default values for proxy-server and server images.
  • Ensure user provides both proxy-server-image and server-image when --use-pathways is set and vice-versa.
  • Validate that the proxy-server-image and server-image exist on GCR.
  • Remove a Pathways flag.

Testing / Documentation

Testing details.
Additionally, I did some manual tests to see that the user gets the right error message for -

[XPK] --proxy-server-image and --server-image need to be provided with --use-pathways.

Also manually tested that invalid docker images are detected and user is guided with an error message.

[XPK] Failed to validate your docker image, check the docker image. You should be able to navigate to the URL gcr.io/hello_world in cloud-tpu-multipod-dev
[XPK] Please check the proxy-server-image or server-image as advised!
  • [ y ] Tests pass
  • [ y ] Appropriate changes to documentation are included in the PR

@RoshaniN RoshaniN force-pushed the roshanin-check-use-pathways branch from 9a719a3 to 24f8d5b Compare March 27, 2024 20:20
@sadikneipp sadikneipp mentioned this pull request Mar 27, 2024
@RoshaniN RoshaniN force-pushed the roshanin-check-use-pathways branch from 570f4ae to 86bdd91 Compare March 27, 2024 20:55
--proxy-server-image=<Pathways proxy server image> \
--docker-name='user-workload' \
--docker-image=<maxtext docker image> \
--command='bash /usr/pathways/ifrt/maxtext_entrypoint.sh base_output_directory=<output directory> dataset_path=<dataset path> per_device_batch_size=1 enable_checkpointing=false enable_profiler=false remat_policy=full global_parameter_scale=4 steps=300 max_target_length=2048 use_iota_embed=true reuse_example_batch=1 dataset_type=synthetic attention=flash gcs_metrics=True run_name=$(USER)-pw-xpk-test-1'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious if you have a simpler example. This seems like a complicated README case example to maintain.

Like in the non-pathways workload command example we just print hello world. Is that possible here too?

Copy link
Collaborator Author

@RoshaniN RoshaniN Mar 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the question, Victor! Let me check if other dependencies are up-to-date. If so, I can simplify it to more like "python3 Maxtext...", however, we really want the workload to actually train!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

however, we really want the workload to actually train!

Agreed this is the ultimately goal but we should go one step at a time. How about we have both a simple workload and a MaxText example?

I want users to be able to verify they have xpk + system working before trying to train.

Comment on lines +343 to +344
--docker-name='user-workload' \
--docker-image=<maxtext docker image> \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--base-docker-image flow is more recommended for xpk users so that local changes can be iterated on. Is that possible for pathways?

Copy link
Collaborator Author

@RoshaniN RoshaniN Mar 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the docker-image provided by the users is a fixed docker image, but if the others deps (mentioned above) are merged, I can update docker-image to base-docker-image! I would also prefer that!

xpk.py Outdated Show resolved Hide resolved
xpk.py Outdated Show resolved Hide resolved
xpk.py Outdated Show resolved Hide resolved
xpk.py Show resolved Hide resolved
@@ -4857,17 +4896,15 @@ def directory_path_type(value):
workload_pathways_workload_arguments.add_argument(
'--proxy-server-image',
type=str,
default='gcr.io/cloud-tpu-v2-images/pathways/pathways-demo:proxy_server',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lmk if you do want to keep the ability to have a default and we can think through how this would look!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is super important, as folks would need to replace it during iteration!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

during iteration

Can you go into this more? will customers need to replace this often? curious if we should make it simpler if so.

@RoshaniN RoshaniN force-pushed the roshanin-check-use-pathways branch from 86bdd91 to 24b9a11 Compare March 27, 2024 22:30
@RoshaniN RoshaniN marked this pull request as draft March 28, 2024 17:50
@RoshaniN
Copy link
Collaborator Author

RoshaniN commented Mar 28, 2024

I will be punting this PR for later. With the way parsers are configured, it is not straightfoward to ensure --use-pathways is provided when any of the other Pathways args are provided.

I will resume when the requirements are clearer and this edge case takes priority.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants