Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: docs changes for searcher context removal #10182

Merged
merged 9 commits into from
Nov 1, 2024
Merged

Conversation

azhou-determined
Copy link
Contributor

@azhou-determined azhou-determined commented Oct 31, 2024

Ticket

Description

a few docs updates from searcher context removal.

Test Plan

Checklist

  • Changes have been manually QA'd
  • New features have been approved by the corresponding PM
  • User-facing API changes have the "User-facing API Change" label
  • Release notes have been added as a separate file under docs/release-notes/
    See Release Note for details.
  • Licenses have been included for new code which was copied and/or modified from any external code

@azhou-determined azhou-determined requested review from a team as code owners October 31, 2024 22:14
@cla-bot cla-bot bot added the cla-signed label Oct 31, 2024
@determined-ci determined-ci requested a review from a team October 31, 2024 22:14
@determined-ci determined-ci added the documentation Improvements or additions to documentation label Oct 31, 2024
Copy link

netlify bot commented Oct 31, 2024

Deploy Preview for determined-ui canceled.

Name Link
🔨 Latest commit b9d4927
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/67255e339ce80a0008ac3616

Copy link

codecov bot commented Oct 31, 2024

Codecov Report

Attention: Patch coverage is 41.66667% with 7 lines in your changes missing coverage. Please review.

Project coverage is 54.25%. Comparing base (f9ac6bc) to head (b9d4927).
Report is 12 commits behind head on main.

Files with missing lines Patch % Lines
...determined/pytorch/deepspeed/_deepspeed_context.py 55.55% 4 Missing ⚠️
master/pkg/searcher/searcher.go 0.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #10182      +/-   ##
==========================================
- Coverage   54.26%   54.25%   -0.01%     
==========================================
  Files        1259     1259              
  Lines      157284   157293       +9     
  Branches     3643     3643              
==========================================
- Hits        85357    85347      -10     
- Misses      71794    71813      +19     
  Partials      133      133              
Flag Coverage Δ
backend 45.90% <0.00%> (-0.03%) ⬇️
harness 71.13% <55.55%> (-0.01%) ⬇️
web 54.30% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
harness/determined/pytorch/_trainer_utils.py 57.50% <ø> (ø)
harness/determined/pytorch/deepspeed/_trainer.py 49.54% <ø> (ø)
master/pkg/searcher/searcher.go 46.72% <0.00%> (ø)
...determined/pytorch/deepspeed/_deepspeed_context.py 62.68% <55.55%> (-0.34%) ⬇️

... and 5 files with indirect coverage changes

@azhou-determined azhou-determined requested review from MikhailKardash and removed request for corban-beaird October 31, 2024 22:34
Comment on lines 477 to 478
+ # Set flag used by internal PyTorch training loop
+ os.environ["DET_MANUAL_INIT_DISTRIBUTED"] = "true"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, what?

If we can't fix this before the release ships, then I vote we remove this section and add it back later when we fix it.

This is basically a bug, and we're documenting it here rather than fixing it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, this is copied from pytorch-ug, where we document similar behavior. which isn't a bug, but a hackily-supported use case.

i'm pretty sure this isn't needed here though, since either we're in local mode, we init the dist backend and pass it into the ds.init(), or we're on cluster and it's None.

so i removed it, cc: @MikhailKardash

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are documenting internal flags, that is a bug. It's also an oxymoron, since documenting it means it's not really internal, just a weirdly bad public API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"it's a feature, not a bug!"

lol, IIRC at the time of landing pytorch trainer, we decided to do this because we wanted to document a shiny new local training capability but didn't have time to figure out a good way to do it.

you're right that it's a bad public API. i've put it on my todo list, and i'll fix it next week. it's not related to this PR or this feature, though.

Comment on lines 586 to 588
If your training code needs to read some values from the experiment configuration,
``pytorch.deepspeed.init()`` accepts an ``exp_conf`` argument which allows calling
``context.get_experiment_config()`` from ``DeepSpeedTrialContext``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The experiment config isn't safe for users to read from, because we automatically shim it from time to time.

The fact that it's part of the context is like, ancient legacy.

Instead, tell people to access info.trial.user_data for the data field or info.trial.hparams for hyperparameters.

integrates Keras training code with Determined through a single :ref:`Keras Callback
<api-keras-ug>`.

- API: introduce ``deepspeed.Trainer``, a new high-level training API for DeepSpeedTrial that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the changes to DeepSpeedTrialContext was to remove det.EnvContext from it. This used to be accessible via context.env. I mentioned this as a potential breaking change for some users, who may have to move to the context.get_experiment_config() and context.get_hparams() methods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, should probably just tell users it's gone completely and they should read config from the cluster info.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't technically be breaking because it was never documented, so it was not technically public.

But it's so old that it wasn't prefixed with _ because we didn't used to be good about doing that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it wasn't documented before, we don't need to document that it is gone, imo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I reference it is because our examples had code that used it (gpt_neox before the Trainer API rewrites, for example)

@determined-ci determined-ci requested a review from a team October 31, 2024 22:47
@@ -319,25 +319,6 @@ While debugging, the logger will display lines highlighted in blue for easy iden
Validation Policy
*******************

.. _experiment-config-min-validation-period:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uh, I should probably tell you that @tara-det-ai told me we should mark it as deprecated but not delete it from the docs.

This is contrary to my preference, which is to let users look in old docs if they want to use old features.

But also it makes sense in this case because we don't have any deprecation warnings anywhere in the system for these fields; users would only know by looking at the docs.

Also this means we need to revert some of my docs deletions from the searcher-context-removal branch, which I realize now that I forgot to do 😞. It would be the removals from this file, I think.

That said, the other places where you remove things like "use min_validation_period to control validations" should be removed. But the actual reference here should not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

welp. ok then. i added them all back.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, sorry.

@@ -46,7 +46,7 @@ def __init__(self, context: det_ds.DeepSpeedTrialContext,
self.discriminator = self.context.wrap_model_engine(discriminator)
self.fixed_noise = self.context.to_device(
torch.randn(
self.context.train_micro_batch_size_per_gpu, self.hparams["noise_length"], 1, 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this a bug, or is this a breaking change to the API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both? it was a bug because it was a breaking change. added to release notes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only reason we are breaking existing DeepSpeedTrials? Like, we're deprecating old paths, yes, but we're only actually breaking user code for this?

Can we include a @property-style getter that makes this not a breaking change?

something like:

    @property
    def train_micro_batch_size_per_gpu(self) -> int:
        warnings.warn(
            "DeepSpeedTrialContext.train_micro_batch_size_per_gpu has been deprecated in "
            "Determined 0.38.0; please use the context.get_train_micro_batch_size_per_gpu() getter "
            "instead."
            FutureWarning,
            stacklevel=2,
        )
        return self.get_train_micro_batch_size_per_gpu()

@determined-ci determined-ci requested a review from a team November 1, 2024 22:46
Comment on lines 15 to 17
- DeepSpeed: the ``num_micro_batches_per_slot`` and ``train_micro_batch_size_per_gpu`` attributes
on ``DeepSpeedContext`` have been replaced with ``get_train_micro_batch_size_per_gpu()`` and
``get_num_micro_batches_per_slot()``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no longer breaking

Copy link
Member

@rb-determined-ai rb-determined-ai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome!

@determined-ci determined-ci requested a review from a team November 1, 2024 23:00
@azhou-determined azhou-determined merged commit f02872a into main Nov 1, 2024
79 of 97 checks passed
@azhou-determined azhou-determined deleted the scr-docs branch November 1, 2024 23:25
github-actions bot pushed a commit that referenced this pull request Nov 1, 2024
update docs and add release note for searcher context removal in 0.38.0

(cherry picked from commit f02872a)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport release-0.38.0 cla-signed documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants